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(54) Title: A SYSTEM FOR PREDICTING FUTURE HEALTH 
(57) Abstract 

A computer-based system is disclosed for predicting future health of individuals comprising: (a) a computer comprising a processor 
containing a database of longitudinally-acquired biomarker values from individual members of a test population, subpopulation D of 
aid members being identified as having acquired a specified biological condition within a specified time period or age interval and a 
subpopulation D being identified as not having acquired the specified biological condition within the specified time period or age interval- 
and (b) a computer program that includes steps for: (1) selecting from said biomarkers a subset of biomarkers for discriminating between 
members belonging to the subpopulations D and D, wherein the subset of biomarkers is selected based on distributions of the biomarker 
values of the individual members of the test population; and (2) using the distributions of the selected biomarkers to develop a statistical 
procedure that is capable of being used for: (i) classifying members of the test population as belonging within a subpopulation PD having 
a prescribed high probability of acquiring the specified biological condition within the specified time period or age interval or as belonging 
within a subpopulation FD having a prescribed low probability of acquiring the specified biological condition within the specified time 
period or age interval; or (n) estimating quantitatively, for each member of the test population, the probability of acquiring the specified 
biological condition within the specified time period or age interval. 
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A SYSTEM FOR PREDICTING FUTURE HEALTH 
FIELD OF INVENTION 

A computer-based system and method are disclosed for predicting the future health of an 
individual. More particularly, the present invention predicts the future health of an individual 
by obtaining longitudinal data for a large number of biomarkers from a large human test 
population, statistically selecting predictive biomarkers, and determining and assessing an 
appropriate multivariate evaluation function based upon the selected biomarkers. 

BACKGROUND OF THE INVENTION 

It would be. desirable if the onset of future health problems could be predicted for an 
individual with sufficient reliability far enough into the future so that the chances could be 
increased for preventing future health problems for that individual rather than waiting for 
actual onset of a disease and then treating the symptoms. At present, the overwhelming 
fraction of medical research funding is directed toward improving methods of diagnosis and 
treatment of disease rather than toward discovering preventive measures that could be 
directed toward reducing the risk of disease long before any of the typically observed 
symptoms of the disease are evident. Although the emphasis on treatment of diseases may 
have led to enormous advances in the medical sciences in terms of the large number and great 
sophistication of the techniques and methods developed for diagnosing existing diseases as 
well as for treating the diseases after diagnosis, such advances continue to lead to ever- 
increasing costs for treatment. Such costs can have staggering financial consequences for 
individuals as well as for the entire society. Such staggering costs have led to increasing 
public pressure to find ways of reducing medical costs. 

Thus, in addition to the benefit to be gained by an individual who could be informed of the 
high risk of the onset of disease far enough in advance so that effective preventive steps could 
be taken, substantial reductions in overall medical costs might be realized by entire 
communities and/or countries. 
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Until now, two of the problems inherent in attempting to assess or predict an individual's 
future health are: (a) such predictions are imprecise because they are based on data obtained 
from relatively small study samples, consisting of a few hundred or even a few thousand 
subjects, and (b) the predictions require extrapolation to individual persons from the mean 
5 (and other parameters) of that sample. Such extrapolations are highly problematic with 

respect to reliably estimating the risk of a specific individual, even within a group at high risk 
for a specific disease. This is true, in part, because the statistical procedures that are typically 
used are designed to make inferences about population means, not about individual members 
of the population. 

10 

To obtain quantitative predictions, an "individual's future health" must be designated as the 
occurrence of a specific event within a specified timeframe. Two examples are: (a) 
occurrence of a myocardial infarction within the succeeding five years, (b) the individual's 
death within the next year. Predictions of such events are necessarily probabilistic in nature. 

15 

Two types of probability are important in this context. The a priori probability of an event is 
the probability of the event, before the fact of the event's occurrence or non-occurrence. The 
post hoc probability of an event is the probability of the event after the event is realized, i.e., 
after the event's occurrence or non-occurrence. Clearly the post hoc probability of an event 
20 is 1 if the event occurred and 0 if the event did not occur. The distinction between the a 
priori probability and post hoc probability is worthy of note. 

The a priori probability of an event occurring in the subsequent year, or other time interval, 
can be important information. Knowledge of the probability of an event can modify behavior 
25 or, put another way, the actions one takes (behavior) can depend on the a priori probability of 
an event. This principle is made self evident by considering two extreme cases. One would 
almost surely exhibit different behaviors (take different actions) under the two scenarios: one 
is informed that one's probability of death in the coming year is (a) 0.9999, or (b) 0.0001. 

30 The a priori probability of an event depends upon the information available at the time the 
probability is evaluated. To illustrate the point, consider the following hypothetical "game." 
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A living person will be selected at random from all U.S. residents and followed for a period 
of one year. At the end of the year the person's vital status (alive or dead) will be ascertained. 
The "event" is "the person died during the year." At the end of the year the event either 
occurred (person died) or did not occur (person survived) with post hoc probabilities of 1 and 
0. respectively. Before the person is selected, the U.S. mortality statistics can be used to 
estimate the a priori probability that the person will die in the year. This probability is 
computed as p~d/N, where /V is the total number of persons in the at risk group (here, all the 
persons in the U.S. population who were alive at the beginning of the year) and d is the total 
number of deaths among the at risk group. For example, the data from calendar year 1993 are 
(approximately), d- 2,268,000, N - 257,932,000, and the a priori probability of the event is 
approximately p = 0.0088. [Data from Microsoft Bookshelf 1995 Almanac, article entitled, 
"Vital Statistics, Annual Report for the Year 1993 (Provisional Statistics), Deaths.*' and Vital 
Statistics of the United States, published by the National Center for Health Statistics.] In this 
game, the a priori probability of the event is based upon very little information, simply that 
the person would be a member of the at risk group, consisting of all persons who would be 
alive and a U.S. resident at the time of selection. 

Additional information about the at risk group, from which the subject is selected at random, 
implies additional information about the subject and modification of the a priori probability 
of the event. For example, continuing the "game" above, based on 1993 data: 

• If the at risk group were the group of U.S. males, i.e., if the subject is known, prior to 
selection, to be a male, the a priori probability of the event is approximately p - 
0.0093, which is about 6% higher than the case where gender is unknown or 
unspecified. 

f the at risk group were the group of U.S. males aged 75-84, i.e., if the subject is 
■wn, prior to selection, to be a male in the age interval 75-84, the a priori 

"ityof the event is approximately p - 0.0772, or about 8.3 times as high as for 
e age is unknown or unspecified. 
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These example: illustrate \he general principle that the a priori probability of an event 
depends upon;he informatior^vailable ai the time the probability is evaluated. The most 



accurate erjmate of an a priori probability is typically the one based on all of the available 
information. 

i 

A vey accurate estimate of an a priori probability does not guarantee a specific outcome; that 
is. tie a priori probability for a specific individual may not be^ery close to the post hoc 
^rcbability. Consider the extreme case cited above, where the a priori probability of death of 
.^ecific individual in the succeeding year is 0.0001. Although survival is highly probable. 



as 



it is n&.5 uarameed: °^ a11 i ndiviciuals * n this "game," approximately 9,999 of aach 10,000 
will survived y ear ^ ^ ave a P ost ^ oc probability of 0 (which is close to the a priori 
probability, 0.0001^ 1 ofeach 10 ' 00 ° wiU die and have a post hoc 

probability of 1 , which is very different ffSIP^e a priori probability. To further elucidate 
this principle consider a fair coin toss in which the^j priori probability of "heads" is exactly 
0.5. the post hoc probability of "heads" is either 0 or 1, neltiVqof' which is very close to 0.5. 
Thus, the a prioripvobahility for one individual should not be consK^ ed an approximation 
of the pas/ hoc probability for that individual. "However, if a very large niiH*b er of individuals 
"play the game," the mean of the post hoc probabilities, which is also the prop&ff.V™ °f 



individuals for whom the event occurs, will be very clos/to the a priori probability. \ 



9 



In some cases a person can change an a priori prob^oi-JSty by "moving" to a group with a 
different a priori probability. For example, epidemiologists have shown that a U.S. resident, ' 



middle-aged male with a high- total cholesterol level, including a high low-density lipoprotein 
level, has a hisher a priori probability of death from myocardial infarction in the succeeding 
five years than a Jompar ab I ei^rson with a much lower cholesterol level. Clinical trial 
research has shovfnthat if "tHp high-cholesterol persorfcan reduce his cholesterol level 
substantially, id f taove"'1p a muchjlower cholesterjj "group," he substantially reduces his a 
priori probability ,pf death 'from myocardial infarctifn in the succeeding five years. 
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In succeeding paral^s anjJ sectfp the w o r d r# wi' 1 be usectjin place of the phrase "a 
priori probability ^ a specified qW 1 wthin a specified timeframe." This corresponds to'the 
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Example. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The present invention will now be described in detail for specific preferred embodiments of 
5 the invention, it being understood that these embodiments are intended as illustrative 
examples and the invention is not to be limited thereto. 

The present invention is based on the theory that an individual's health is. in general, 
influenced by a complex interaction of a wide, range of physiological and biochemical 

1 0 parameters relating to the nutritional, toxicological, genetic, hormonal, viral, infective, 

anthropometric, lifestyle and any other states potentially describing the aberrant physiological 
and putative pathological states of that individual. Based on this theory, the present invention 
is directed towards providing a practical system for predicting future health using multivariate 
statistical analysis techniques that are capable of providing quantitative predictions of one's 

1 5 future health based on statistically comparing an individual's set of biomarker values with a 
longitudinally-obtained database of sets of a large number of individual biomarker values for 
a large test population. The term "biomarker" is used herein to refer to any biological 
indicator that may affect or be related to diagnosing or predicting an individual's health. The 
term "longitudinal" is used herein to refer to the fact that the biomarker values are to be 

20 periodically obtained over a period of time, in particular, on at least two measurement 
occasions. 

The frequency and duration of longitudinal assessments may vary. For example, some 
biomarkers may be assessed annually, for periods ranging from as short as 2 years to a period 

25 as long as a total lifetime. Under some circumstances, such as evaluation of newborn 

children, biomarkers could be assessed more frequently as, for example, daily, weekly, or 
monthly. Longitudinal assessment occasions may be "irregularly timed," i.e., occur at 
unequal time intervals. The set of longitudinal assessments for an individual may be 
^complete," meaning that data from all scheduled assessments and all scheduled biomarkers 

30 are actually obtained and available, or "incomplete," meaning that the data are not complete 
in some manner. An individual's biomarkers may be assessed either cross-sectionally, i.e., at 
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one point in time, or longitudinally . The present invention is capable of performing the 
required statistical analyses of data from individuals that have any or ail of the characteristics 
noted above, i.e., cross-sectional or longitudinal, regularly or irregularly timed/complete or 
incomplete. 

5 

The subject system for assessing future health provides a quantitative estimate of the 
probability of an individual acquiring a specified biological condition within a specified 
period of time. The quantitative probability estimate is calculated using the sequence of 
statistical analyses of the present invention. The subject system may typically be used to 
10 provide quantitative predictions of future biological conditions for one, two, three, five, or, 
ultimately, even 15-20 or more years into the future. Although the subject system may 
typically be used long before symptoms of a particular disease are usually observed or 
detected, the subject system may also be used for predicting future health over relatively short 
time periods of only a few months or weeks, or even shorter time periods, as well. 

15 

While there is no upper limit to the number of members that may included in the test 
population, which might eventually include several million test members, a representative test 
population may include far smaller numbers initially. The test population may be selected 
from a much larger general population using appropriate statistical sampling techniques for 
20 improving the reliability of the data collected. 

In a representative embodiment, the present invention is directed to a computer-based system 
that uses a series of statistical analysis steps for creating mathematical-statistical functions 
that can be used to estimate an individual's risk of acquiring a specified biological condition 

25 within a specified time period or age interval and to identify individuals that are at highest 

risk. Prior to Phase I of the subject method, the available subjects may be randomly assigned 
to a Training Sample or an Evaluation Sample; Phases I-III operate on data from the Training 
Sample and Phase IV operates on data from the Evaluation Sample. Phase I is a Screening 
Phase that uses correlation, logistic regression, mixed model, and other analyses to select a 

30 large subset of biomarkers that have potentially useful information for risk estimation. 
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Phase II is a Parameter Estimation Phase that uses mixed linear models to estimate expected 
value vector and structured covariance matrix parameters of the Candidate Biomarkers. even 
in the presence of incomplete data and/or irregularly timed longitudinal data. Phase III is a 
Biomarker Selection and Risk Assessment Phase that uses discriminant analysis methodology 
5 and logistic regression to select informative biomarkers (including, where relevant, 

longitudinal assessments), to estimate discriminant function coefficients, and to use an 
inverse cumulative distribution function and logistic regression to estimate each individual's 
risk. Phase IV is an Evaluation Phase that uses the Evaluation Sample to produce unbiased 
estimates to the misclassification rates of the discriminant procedure. 

10 

Although the individual steps of the statistical procedures noted in the previous paragraph are 
described in the statistical literature, it is believed that these individual steps have never been 
combined in a single overall procedure as disclosed herein. In particular, classical versions of 
the following procedures are described for example, in the Encyclopedia of Statistical 

15 Sciences , edited by Samuel Kotz, Normal L. Johnson, and Campbell B. Read, published by 
John Wiley & Sons, 1985 and in additional literature cited therein: (a) correlation analysis 
(Volume 2, pp. 193-204), (b) logistic regression analysis (Volume 5, pp. 128-133), (c) mixed 
model analysis (Volume 3, pp. 137-141, article "Fixed-, Random-, and Mixed-Effect 
Models"), (d) discriminant analysis (Volume 2, pp. 389-397). The present invention can 

20 utilize classical versions of these procedures or such enhancements to and newer versions of 
these procedures as may be developed and published from time to time. 

Correlation analysis is a term for statistical methods used for estimating the strength of the 
linear relationship between two or more variables. Correlation, as used here, can include a 
25 variety of types of correlation, including but not limited to: Pearson product-moment 
correlations, Spearman's p, Kendall's t, the Fisher- Yates r F , and others. 

Logistic regression is a term for statistical methods, including log-linear models, used for the 
analysis of a relationship between an observed dependent variable (that may be a proportion, 
30 or a rate) and a set of explanatory variables. The applications of the logistic regression (or 
other log-linear models) used herein are primarily for the analysis in which the dependent 

11 
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variable is a binary outcome representing an individual's membership in one of two 
complementary (non-overlapping) groups of subjects: a group that will acquire a specified 
disease or condition (sometimes referred to herein as a "specified biological condition") 
within a specified period of time or age interval, and a group that will noi acquire the 
5 specified disease or condition within a specified period of time or age interval. In this context 
the explanatory variables are typically biomarkers or functions of biomarkers. 

Mixed model analysis is a term for statistical methods used for the analysis of expected-value 
relationships between correlated dependent variables (multivariate measurements or 

10 observations, longitudinal measurements/observations of one variable, and/or longitudinal 
multivariate measurements/observations) and "independent variables" that can include 
covariates, such as age, classification variables (representing group membership) and also 
used for analysis of structures and parameters representing covariances among correlated 
measurements/observations. The term "mixed models" includes fixed-effects models, 

15 random-effects models, and mixed-effects models. Mixed models may have linear or 

nonlinear structures in the expected-value model and/or in the co variance model. A mixed 
model analysis typically includes estimation of expected value parameters (often denoted P) 
and covariance matrix parameters (often of the form S = ZAZ'+V, where A and V are 
matrices of unknown parameters). A mixed model analysis may also include predictors of 

20 random subject effects (often denoted d k for the A-th- subject) and so-called "best linear 

unbiased predictors" (or "BLUPs") for individual subjects. A mixed model analysis typically 
includes procedures for testing hypotheses about expected value parameters and/or covariance 
parameters and for constructing confidence regions for parameters. 

25 In particular, discriminant analysis methodology relates to statistical analysis methods and 
techniques for developing discriminant functions that may be used for assigning a 
multivariate observation (e.g., a vector of biomarker values from one subject) to one of two 
complementary (non-overlapping) groups of subjects (e.g., a group that will acquire a 
specified disease or condition within a specified period of time or age interval, and the group 

30 that will not acquire the specified disease or condition within a specified period of time or age 
interval), on the basis of its value. A discriminant function, furthermore, may refer to a 
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function that is used as the basis for calculating an estimate of the probability that a given 
observation belongs in a given group. For the present invention, the observations of interest 
typically comprise a plurality of biomarker values that are obtained from each member of a 
large test population or from an individual test subject. The discriminant functions of the 
5 present invention are developed using distributions of these biomarker values for each 

biomarker determined to be of.interest. Such distributions plot the total number of individual 
members of the test population having each biomarker value vs. the biomarker value itself. 
Thus, the present invention empolys a statistical procedure that uses distributions based on 
the individual biomarker values that are obtained for each biomarker from individual 
1 0 members from the test population, as distinct, for example, from using mean biomarker 
values that are obtained from different test populations for the different biomarkers. 

The term "discriminant function'' is intended to mean any one of several different types of 
functions or procedures for classifying an observation (scalar or vector) into two or more 
1 5 groups, including, but not limited to, linear discriminant functions, quadratic discriminant 
functions, nonlinear discriminant functions, and various types of so-called optimal 
discriminant procedures. - 

The computer-based system of the present invention includes a computer comprised of a 
20 processor that is capable of running a computer program or set of computer programs 
(hereinafter refined to simply as n the computer program") comprising the steps for 
performing the required computations and data processing in the various steps and phases of 
the present invention. The processor may be a microprocessor, a personal computer, a 
mainframe computer, or in general, any digital computer that is capable of running computer 
25 programs that can perform the required computations and data processing. The processor 
typically includes a central processing unit, a random access memory (RAM), read-only 
memory (ROM), one or more buses or channels for transfer of data among its various 
components, one or more display devices (such as a "monitor"), one or more input-output 
devices (such as floppy disk drives, fixed disk drives, printers, etc.), and adapters for 
30 controlling input-output devices and/or display devices and/or connecting such devices to the 
buses/channels. A particular processor may include all of these components or only a subset 
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of these components. 

The computer program may be stored in ROM or on a disk or set of disks, or in any other 
tangible medium that may be used for storing and distributing computer programs. 

The computer program is capable of performing the computations for the various phases and 
steps of the analysis on cross-sectional and/or longitudinal multivariate biomarker data. 

The biomarker data are preferably collected from a test population that is sufficiently large so 
that the total number of members acquiring a specified biological condition of interest within 
a two to three year period is large enough for discriminant analysis methodology to be 
meaningfully employed for that specified biological condition. Since one of the features of 
the present invention is directed toward providing a means for using the same database to 
make predictions relating to acquiring any of the major diseases and/or dying from any of the 
major underlying causes of death within as few as one to two years, the test population is 
preferably large enough to be useful for applying the subject system to any one of the more 
common diseases and underlying causes of death that account in the aggregate for at least 
about 60%, and more preferably, at least about 75%, of all deaths of interest, wherein the 
deaths of interest are herein defined as those of a pathological nature, as distinct from those 
caused by accident, homicide or suicide. 

For example, using data from Center for Disease Control and Prevention (Monthly Vital 
Statistics Report, Supplement, VoL 44, No. 7, Feb 26, 1996), it can be shown that more than 
75% of all pathologically derived deaths can be accounted for by the following underlying 
causes of death, malignant neoplasms (LCD 140-208) having a crude death rate, that is, as 
distinguished from an age-adjusted death rate, of 205.6/100,000; major cardiovascular 
diseases (ICD 390-448), 367,8/100,000; chronic obstructive pulmonary diseases, (ICD 490- 
496), 39.2/100,000; and diabetes mellitus (ICD 250), 20.9/100,000; as compared with a total 
crude death rate of about 880/100,000 for pathologically derived deaths. These diseases are 
the ones which, in fact, have been shown to exhibit major dietary and lifestyle effects, to be 
responsive to altered dietary and lifestyle conditions, and to be indicated by a variety of 
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definable and measurable biomarkers. 

As one of the unique features of the present invention, the subject computer-based system and 
apparatus may be used to determine the risk of a specified individual acquiring any one of 
these major diseases based on comparing that individual's profile of biomarker values with 
the biomarker values obtained from members of a large test population. Since it is known 
that these major diseases share many common factors that may be reflected in the biomarker 
values, the subject computer-based system may be used to concurrently assess the risk of 
acquiring any of these major diseases. For example, it is known that total serum cholesterol 
is a biomarker that is related to many of these diseases. By monitoring each profile of 
biomarker values that is a significant predictor, in combination with other significant 
biomarker predictors, of a specific disease or underlying cause of death and using the present 
invention to compare that profile with the test populations, an individual subject may be 
informed, with specified quantitative reliability, which disease poses the greatest risk for that 
specific individual. 

A particular feature of the present invention is that those individuals who are at greatest risk 
of acquiring a specified disease may be provided with a quantitative probability of acquiring 
that disease within a specified time period or age interval in the future well before any of the 
typical symptoms of that disease are manifest. Axmed with that information, for the many 
diseases known to be responsive to altered dietary and lifestyle conditions, that individual 
may then make those behavioral changes that can reduce the risk of the disease identified. 

Furthermore, as more and more data are acquired for larger and larger numbers of subjects 
over longer and longer periods of time, more and more refined divisions of each of the major 
diseases and causes of deaths as well as of the less common diseases and underlying causes of 
death can be defined and included in the methodology of the present invention. For example, 
a breakdown can be made in terms of the different types of cancer, e.g., liver cancer, lung 
cancer, stomach cancer, prostate cancer, etc. The present computer-based system, thus, 
provides a means for including ever larger fractions of the population, so as to predict the 
quantitative risk of each individual acquiring, or not acquiring, a specified pathologically 
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derived disease within a specified time, wherein the diseases are defined with continuously 
narrower specificity. 

The comprehensive set of biomarkers for which biomarker data are collected from the test 
5 population preferably includes as many as possible of the diverse biomarkers known or 

believed to be related to the most common diseases and underlying causes of pathologically 
derived deaths. In addition, representative clusters of biomarker values from each of the 
known and generally accepted genetic, physiological and biochemical domains of biological 
function may be included. Additional biomarkers that are preferably included are, for 
10 example, all those that can be measured in biological samples that may be stored for analysis 
long after the sample is collected. 

The biological samples preferably include a blood and a urine sample, but still other 
biological samples may be included in the samples that are collected. For example, samples 
1 5 of saliva, hair, toenails and fingernails, feces, expired air, etc. may also be collected. Such 
biological samples are typically obtained from substantially every member of the test 
population. However, in some situations, specific subsets of biomarkers may be obtained 
only from specific subsets of the population. 

20 Concurrent with collecting the biological samples, biomarker data relating to nutritional 
habits and lifestyles are also typically obtained from each member of the test population. 
Biomarkers relating to nutritional habits and life styles may include, for example, those 
shown in Table 1. While the nutritional- and life-style-biomarkers listed in Table 1 are 
intended to be illustrative of the types of biomarkers relating to nutritional habits and life 

25 styles, it is to be understood this list is not exhaustive of the nutritional and life style 

biomarkers that fall within the scope of the present invention, The biomarkers that exhibit 
significant nutritional determinism, as well as the clinical and infections biomarkers, may 
also be determined by other factors, such as by nutritional intake. The delineation of 
categories, (e.g. serum biomarkers, urine biomarkers, questionnaire, etc.), shown in Table 9 

30 is, thus, only an illustrative division of the categories that may be selected to obtain the 

biomarker values. The nutritional and life style biomarkers that may change over time are 
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preferably collected and recorded for each member of the test population each time a 
biological sample is taken. 

TABLE 1. An illustrative list of biomarkers that may be used in the subject method for 
predicting future health. . 



SERUM BIOMARKERS 
Total cholesterol* 
HDL cholesterol* 
LDL cholesterol* 
Apolipoprotein b* 
Apolipoprotein A ( * 
Triglycerides* 

Lipid peroxide (Malondialdehyde 

equivalency ;TB A)* 

a- Carotene (corrected for lipoprotein 

carrier)* 

p-Carotene (corrected for lipoprotein 
carrier)* 

y-Carotene (corrected for lipoprotein 
carrier)* 

zeta-Carotene (corrected for lipoprotein 
carrier)* 

a-Cryptoxanthin (corrected for lipoprotein 
carrier)* 

p-Cryptoxanthin (corrected for lipoprotein 
carrier)* 

Canthaxanthin (corrected for lipoprotein 
carrier)* 

Lycopene (corrected for lipoprotein 
carrier)* 

Lutein (corrected for lipoprotein carrier)* 
anhydro-Lutein (corrected for lipoprotein 
carrier)* 

Neurosporene (corrected for lipoprotein 
carrier)* 

Phytofluene (corrected for lipoprotein 
carrier)* 

Phytoene (corrected for lipoprotein 
carrier)* 

a-Tocopherol (corrected for lipoprotein 
carrier)* 

y-Tocopherol (corrected for lipoprotein 

carrier)* 

Retinol* 



Retinol binding protein* 

Ascorbic acid* 

Fe* 

K* 

Mg* 

Total phosphorus* 
Inorganic phosphorus* 
Se* 
Zn* 

Ferritin* 

Total iron binding capacity* 

Fasting glucose* 

Urea nitrogen* 

Uric acid* 

Prealbumin* 

Albumin* 

Total protein* 

Bilirubin* 

Thyroid stimulating hormone T3* 
Thyroid stimulating hormone T4* 
Cotinine 

Aflatoxin-albumin adducts 

Hepatitis B anti-core antibody (HbcAb) 

Hepatitis B surface antigen (GhsAg+) 

Candida albicans antibodies 

Epstein-Barr virus antibodies 

Type 2 Herpes Simples antibodies 

Human Papiloma virus antibodies 

Heliocobacter pylori antibodies 

Estradiol (E2) (adjusted for female cycle)* 

Sex hormone binding globulin* 

Prolactin (adjusted for female cycle)* 

Testosterone (adjusted for female cycle for 

women)* 

Hemoglobin* 

Myristic acid (14:0)* 

Palmitic acid (16:0)* 

Stearic acid (18:0)* 

Arachidic acid (20:0)* 
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Hellenic acid (22:0)* 

Tetracosaenoic acid (24:0)* 

Mynsucoleic acid (I4:l)* 

Palmitoleic acid (16:1)* 

Oleic acid (I8:ln9)* 

Gadoleic acid (20:1)* 

Erucic acid (22:ln9)* 

Tetracosaenoic acid (24: 1)* 

Linoleic (I8:2n6)* 

Linoleic acid (I8:3n3)* 

Y -Gamma linoleic (I8:3n6)* 

liicosadicnoic acid (20:2n6)* 

Dt-homo-Y-Iinolenic acid (20:3n6)* 

Arachidonic acid (20:4n6)* 

Eicospentaenoic acid ! (20:5n3)* 

Docosatetraenoic acid (22:4n6)* 

Docosapemaenoic acid (22;5n3)* 

Docosahexaenoic acid (22:6n3)* 

Total saturated fatty acids (16:0, 18:0, 20:0, 

22.0.24:0)* 

Total monounsaturated fatty acids (14:1, 

16:1. 18:109, 20:1, 24:1)* 

Total n3 polyunsaturated fatty acids 

(I8:3r\3. 20:5n3, 22:5n3, 22:6n3)* 

Total n6 polyunsaturated fatty acids 

( I8;3n6, 20:2n6, 20:3n6, 20:4n6, 22:4n6)* 

Total n3 polyunsaturated/tela! n6 
polyunsaturated fatty acids (I8:3n3, 2G:5n3, 

22:5n3, 22:6n3/1 8:3n6, 20:2n6, 20:3n6, 20:4n6, 
22:4n6)* 

Total polyunsaturated fatty acids (I8:2n6, 

I 8:3n3, 18:3n6, 20:2n6, 20:3n6, 20:4n6, 20:5n3, 
22:4n6, 22:5n3, 22:6n3)* 

Total polyunsaturated/saturated fatty acids 

(I8:2n6 t 18:3n3. I8:3n6, 20:2n6, 20:3n6, 20:4n6, 
20:5n3, 22:4n6, 22:5n3, 22:6n3/16:0, 1 8:0, 20:0, 
22:0,24:0)* 

[About 10-30 genetic markers, depending 
on diseases being investigated] 

URINE BIOMARKERS 

Orotidine 
CI* 
Mg* 
Na* 

Creatinine 

Volume 

N0 3 

Aflatoxin (AF) M, 



AF N 7 guanine 
AFP t 
AF Q, 
Aflatoxicol 
8-deoxy guanosine 

FOOD DERIVED NUTRIENT 

INTAKES (FROM QUESTIONNAIRE) 

Total protein* 

Animal protein* 

Plant protein* 

Fish protein* 

Lipid* 

'Soluble' carbohydrate* 
Total dietary fiber* 
Total calories* 

Percentage of caloric intake from lipids* 

Cholesterol* 

Ca* 
p* 

Fe* 

K* 

Mg* 

Mn* 

Na* 

Se*~ 

Zn* 

Total tocopherols (corrected for lipid 

intake)* 

Total retinoid* 

Total carotenoid* 

Thiamine* 

Riboflavin* 

Niacin* 

Vitamin C* 

[About 30 different ypes of foods]* 
[About 30 different fatty acids]* 

RED BLOOD CELLS 

RBC glutathione reductase* . 

RBC catalase* 

RBC superoxide dismutase* 

ANTHROPOMETRIC PARAMETERS 

Height* 

Weight* 

* Indicates biomarkers which exhibit 
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The biological samples are analyzed to determine the biomarker value for each component in 
the biological sample for which a biomarker value is desired. It is to be understood that any 
component that may be found and measured in a biological sample falls within the scope of 
the present invention. For example, genetic biomarkers which may be measured in a blood 
sample, as well as the biomarkers that can be measured in any other appropriate biological 
sample, may also be included. 

Since another feature of the present invention is that of identifying new sets of biomarkers 
useful for predicting disease and death, the biomarker sets may include biomarkers not 
previously known to have statistical significance for predicting a specific disease or specific 
cause of death. Thus, since the total number of biomarkers that may be used is substantially 
unlimited in principle, the actual number of biomarkers used may, in general, be limited only 
by practical economic and methodological considerations. 

Since still another feature of the present invention is that of providing a computer-based 
system for predicting specified biological conditions within a specific time period or age 
interval in the future, the total number of biomarker values may be limited to only those 
biomarker values which have statistical significance for predicting a single specified 
biological condition. Thus, while it is intended that the subject system is typically used as a 
general purpose tool for predicting and monitoring most, and, eventually, substantially all 
major types of diseases and underlying causes of death, use of the methodology disclosed 
herein may also be directed to one disease or cause of death at a time. 

After being collected, the biological samples may be analyzed immediately or the samples 
may be stored for later analysis. Since it is expected that a large number of samples may be 
collected in a relatively short period of time and under circumstances not conducive to 
immediate on-site analysis, the samples are preferably stored for later analysis. Because the 
samples may typically be stored for a substantial period of time, the samples are typically 
frozen. The samples are to be stored and transported using conditions that preserve the 
integrity of the samples. Such techniques are described, for example, in Chen, J., Campbell, 
T, C, Li, J., and Peto, R* Diet, life-style and mortality in China. A Study of the 
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Characteristics of 65 Chinese Counties . Oxford, U.K.; Ithaca, NY; Beijing, PRC: Oxford 
University Press; Cornell University Press; Peoples Medical Publishing House, 1990. 

Use of physical specimens such as biological samples are particularly preferred since such 
5 samples provide a practical means of providing a rich source of longitudinally-obtained 

biomarker data that can be collected, stored and analyzed using established, cost-effective 
techniques. The biological samples are preferably collected for the test population over an 
extended period of time of at least 5-10 years, and most preferably, for 15-20 years or more, 
such that the quality of the data generated will continuously provide more and more reliable 
10 probability predictions. 

Since the reliability of the subject system is ultimately determined by the quality of the 
biomarker data collected, appropriate measures are to be taken to assure integrity of the data 
from all aspects. For example, concerning biomarker stability, it is necessary to consider and 
15 take appropriate measures to account for the many factors which may influence or cause 
deterioration of the biomarker values over time. 

Furthermore, while the subject disclosure is typically directed toward obtaining biomarker 
data from physical specimens that are obtained from members of a test population or a test 

20 subject, as well as the biomarker data derived from dietary and lifestyle surveys of each test 
individual, use of biomarker data obtained from any source falls fully within the spirit and 
scope of the present invention. For example, the subject methodology may further comprise 
use of medical diagnostic data obtained from electrophysiological measurement techniques 
such as electroencephalographic (EEG) data, electrocardiographic (ECG) data, radiologic (X- 

25 ray) data, magnetic resonance imaging (MRI), etc., either alone or, most preferably, in 

combination with the longitudinally-obtained biomarker data from biological samples and 
dietary and lifestyle surveys. 

Since the test population is preferably monitored over a period of years, it is to be expected 
30 that a mortality rate will be observed for the test population that is representative of the 
overall general population. For each mortality in the test population, the individual is 
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identified and the underlying cause of death is recorded, preferably using a known coding 
system, for example, the established International Statistical Classification of Diseases and 
Related Health Problems , ([CD- 10), Geneva, World Health Organization, !992-cl994, 10th 
revision. Other coding systems may also be used while remaining within the scope and spirit 
of the present invention. 

Using an effective system to identify when a member of the test population acquires a disease 
or specified biological condition, morbidity data is also collected, in addition to collecting the 
biomarker and mortality data of the test population. 

The database of biomarker values preferably includes information from each individual 
recording the dates and ages at the times the biomarkers and biomarker samples are collected 
and recorded, accurate information from the surveillance of the individual recording each 
incident of disease, medical condition, medical pathology, or death, including diagnosis and 
date of incident. The database includes values of biomarkers assessed before, during, and 
after each incident, where feasible. 

Since one aspect of the present invention relates to identifying biomarkers not yet known to 
be statistically significant for predicting future onset of a specified disease or underlying 
cause of death, as many biomarkers as possible are monitored. In a representative 
embodiment, about 200 biomarker values are obtained from each member of the test 
population, although there is substantially no upper limit to the number of biomarkers that 
may be used to develop the computer-based statistical analysis methodology. 

Since the present invention is directed toward providing a practical and reliable system for 
predicting a specified biological condition within a specified period of time or age interval, a 
substantially complete set of biomarker values is collected from each member of the test 
population at least two different times. More preferably, so as to obtain information on trends 
or changes with time, a full set is collected at least three times and, most preferably, the 
biomarker values are collected at periodic intervals for as long as practically feasible. 
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In still anoiher aspect of the subjection invention, which is based on the theory that ratios of a 
person's individual biomarker values, or changes in the ratios, may be more important for 
predicting future health than the actual level of any given biomarker value, the discriminant 
function is typically determined using substantially complete sets of biomarker values. Since 
it is recognized that for practical reasons totally complete sets of biomarker values cannot 
reasonably be expected to be obtained from every member of the test population on every 
testing occasion, the statistical analysis methodology of this invention includes methods that 
reliably account for incomplete data in a statistically valid manner. 

A further object of the present invention is not only to provide a means of quantitatively 
assessing the risk of future specific diseases, but also to provide a practical tool for defining 
and identifying those biological conditions wherein one has the lowest risk of all future 
diseases. The term "specified biological condition" is, therefore, in the context of the present 
invention, meant to include all ranges of health, from the most robustly healthy to the most 
severely diseased. The present invention is, thus, directed towards providing a system for 
monitoring and predicting future health for the most healthy to the least healthy. 

Although the results obtained from the test population may be used for predicting the future 
health of general populations in particular countries, it is not necessary to select the test 
population from the same general population for which individual future health predictions 
will be made. Such a limitation is not necessary since it is known that populations of 
individuals who possess probabilities of disease which are characteristic of their home 
countries, and who then move to new countries whose populations possess probabilities of 
different sets of diseases, will acquire those diseases which are characteristic of the countries 
to which they move. This occurs during a time coincident with and following their 
acquisition of the diet and lifestyle conditions of the new country. That is, all races and 
ethnic groups of the world tend to acquire the same general diseases regardless of their 
inherited characteristics, which may be unique to each race or ethnic group. 

One of the specific features of the present invention is that a system is provided for predicting 
when onset of a future health problem will occur before the problem may typically be 
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diagnosed. The time of future onset of the specific health problem occurring for a specific 
individual can be predicted with a specified quantitative probability estimate based on 
applying the subject discriminant analysis methodology to the database collected from the 
large test population. Furthermore, the present invention, provides a system for predicting 
specific health problems further and further into the future with greater and greater reliability 
as more and more data are collected for ever larger test populations for longer and longer 
periods of time. 

The biological samples are typically analyzed for each biomarker for which quantitative 
values are desired. For cost and convenience reasons and because of the large number of 
samples that may be collected, the samples may be analyzed initially only for those 
individuals already diagnosed with a disease or who die during the time period over which 
the samples have been collected, as well as for a randomly selected fraction of the remainder 
of the test population. For example, if the annual mortality rate for the test population 
surveyed is typically in the range of 2-3% annually, a 300,000 member test population would 
produce an annual mortality rate of 6000-9,000 deaths, wherein a significant number of 
deaths would have been caused by each of the major underlying causes of death. 

One of the further features of the present invention comprises the step of waiting until a 
substantial number of deaths have occurred in the test population and then selecting those 
individuals as the ones for whom the biomarker values are to be determined initially. In 
addition, a group of still living test members may then be selected from the remainder of the 
test population. Because of the need to balance the need for large enough numbers of 
samples to obtain statistically significant results with the need to control costs, the subject 
system provides a practical method of limiting the analytical measurement costs to only those 
samples that will tend to provide the most information for the least cost. Naturally, as more 
and more deaths occur in the test population, larger and larger numbers of samples will be 
analyzed over time. However, the value of the data obtained, from the point of view of 
establishing more and more reliable quantitative predictions of future health, will be more or 
less commensurate with the cost of acquiring the additional biomarker values. This is another 
of the many special features of the present invention that distinguishes it from any known 
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prior art system. This technique of postponing sample analysis permits postponement of cost 
until the results obtained tend to have greater practical value. 

Upon selecting the samples to be analyzed, the biomarker values may be determined using 
well known methodologies. Since a large number of samples are to be analyzed with each 
being measured for a large number of biomarker values, many, if not most, of these 
measurements are typically made using a multi-channel analyzer, for example, the 
BMD/Hitachi Model 747-100 such as manufactured by the Boehringen Mannheim Corp; of 
Indianapolis, IN. Such analyzers can be designed to measure the biomarker values of 
selected large sets of biomarkers simultaneously using relatively small quantities of the total 
sample. For example, the quantity of blood collected is typically about 15 ml, whereas only 
about 10-30 jA may be required per analytical measurement. Similarly, the quantity of urine 
collected is typically about 50 ml, whereas a quantity of about 100 fj\ is required for the 
analysis. Appropriately small quantities of other biological samples may also be used. 

Since, in the representative embodiments, physically-preservable biological samples may be 
used, and since only relatively small analytical sample quantities may be used for taking 
measurements at any arbitrarily selected time, typically long after the sample has been 
collected, the subject methodology may be effectively applied using any biomarker that is 
detectable within a given sample. For example, although the system may be used initially to 
analyze what are currently deemed to be the more significant biomarkers, the system may be 
readily adapted to include other biomarkers that are not yet recognized to have significance 
for predicting future health. In principle, with adequate time and economic resources, every 
biomarker that is detectable in the preserved biological samples may ultimately be measured. 

Although it may be desirable to acquire substantially complete sets of biomarker values for 
each member of the test population,- this is typically very difficult to realize especially if the 
samples are to be longitudinally collected from a wide, geographically dispersed population 
base. Using conventional statistical analysis methodology, in which an incomplete set of data 
is typically discarded and not used at all, substantial quantities of data ultimately covering a 
large fraction of the initial test population would need to be discarded. This can result in a 
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substantial waste of resources and severe degradation of the quality of the results generated 
by the remaining data. The subject computer-based methodology includes a feature that 
provides a means of using substantially all data collected, by using a statistically verifiable 
technique for filling in the "missing values." This is a particularly useful aspect of the subject 
methodology, which is based on collecting what amounts to huge quantities of data, as 
compared with any prior art studies, for very large numbers of test members from a test 
population that is widely dispersed geographically. Acquisition of comprehensive data from 
a diverse large test population is particularly desirable so as to obtain biomarker values from 
members having widely divergent dietary and lifestyle practice representative of the entire 
human experience. 

For the purpose of describing the present invention, the following terminology is explained 
herein: 

A "specified biological condition' 1 may, for example, refer to any one of the following: 

• a specified disease, for example, as classified in International Statistical Classification 
of Diseases and Related Health Problems , supra, (e.g., diabetes mellitus); 

• a specified medical or health condition or syndrome (e.g., hypertension, as generally 
defined by deviations of biomarker or biomarker set values from the usual normal 
distributions); 

• a specified medical event and its sequelae (e.g., ischemic stroke and subsequent death, 
or non-death and stroke-related partial paralysis and related conditions; myocardial 
infarction and subsequent death, or non-death and Mi-related conditions); 

• premature death from any cause (premature death at an age earlier than the mean age 
at death as projected from the person's gender and age at first evaluation); 

• death at a specified age; 
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• a newly defined category based on having or acquiring a specified set of biomarker 
values for a specified set of biomarkers. 

Acquisition or onset of the specified biological condition refers to the situation wherein a 
person does not have the specified biological condition at the time of a given evaluation, but 
who subsequently experiences the specified biological condition, in which case the person is 
said to have acquired the specified biological condition with onset being defined as occurring 
when the person acquired that specified biological condition. 

For a specified biological condition and for a population of persons who do not have, or have 
not had, the specified biological condition, there are two complementary subpopulations, 
identified as Group D and Group D, and described as follows: 

• Group D: That subpopulation of persons who will acquire the specified biological 
condition within a specified timeframe. As used here, specified timeframe can refer to 
a specified interval of calendar time {e.g., "the next five years 1 '), to a specified age 
interval (e.g., kl between 65 and 70 years of age"), or to a similar specific time or age 
interval. 

• Group D: That subpopulation of persons who will not acquire the specified biological 
condition within the specified timeframe. 

These subpopulations of subjects are partially characterized by a specific longitudinal pattern 
of data on a (possibly large) number of biomarkers. A longitudinal pattern includes not only 
the level or tissue concentration of a biomarker, but also changes in the level. If one knows 
which longitudinal patterns of biomarkers partially characterize the subpopulations, and has 
the necessary data from a specific person, that person can be classified into one of two 
complementary groups, based upon whether the person is projected to belong to Group D or 
to Group D: 



Group PD: That group of persons who, at the beginning of the specified timeframe, 
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are predicted to acquire the specified biological condition within the specified 
timeframe, i.e., projected to belong to Group D. These persons are described as 
having a prescribed high probability of acquiring the specified biological condition 
within the specified timeframe. 



• Group PD : That group of persons who, at the beginning of the specified timeframe, 
are predicted not to acquire the specified biological condition within the specified 
timeframe, i.e., projected to belong to Group D. These persons are described as 
having a prescribed low probability of acquiring the specified biological condition 
10 within a specified timeframe. 

The term "prescribed high probability" may vary in magnitude from having a probability as 
low as a few percent, perhaps even as low as 1% or less, or may be as high as 10%, 20%, 
50%. or even substantially higher, depending on the specified biological condition. For 

15 example, the increased risk of acquiring lung cancer due to smoking may be perceived by 
many as a significant and preferably avoidable risk, even though the actual several-fold 
increase in risk that is caused by smoking may only be in the range of a 5-10% probability for 
. acquiring lung cancer as far as 1 5-20 years or more into the future. In any case, for each 
specified biological condition for which the system is applied, a quantifiably prescribed 

20 probability may be determined. The ''prescribed low probability" may be specified simply as 
the probability of not being in the high risk group for acquiring the specified biological 
condition or, alternatively, the term may be separately specified as a concrete value. 

At the point when a statistically adequate number of the members of the test population can 
25 be identified as belonging to Group D or Group D, the biomarker values of the members of 
Group D may be compared with members of Group D using the subject methodology, so as 
to determine a statistical procedure for classifying members into Groups PD and PD or for 
estimating the probability, for each member of the test population, of acquiring the specified 
biological condition within the specified time period or age interval, i.e., the probability of 
30 belonging to Group PD or the probability of belonging to Group PD . In a representative 

embodiment of the subject invention, the statistical procedure for classifying members into 
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Groups PD andPD will be a form of a discriminant analysis procedure as described below; 
the procedure may be referred to as a "discriminant procedure" or "discrimination 
procedure." A "statistically adequate number" may be defined as one for which the total 
number of biomarkers used in the analysis and the total number of test members for which the 
biomarker values are available are each large enough such that convergence is achieved for 
the computational procedures used in the subject methodology. 

A discrimination procedure has two relevant error rates: 

(1) Proportion of false positives, i.e.. the proportion of future subjects who will be 

classified in Group PD but who actually belong to Group D. 

(2) Proportion of false negatives, i.e.. the proportion of future subjects who will be 

classified in Group PD but who actually belong to Group D. . 
A representative embodiment of the subject invention will incorporate methodology for 
obtaining accurate estimate of these two error rates. 

A representative embodiment of the subject invention consists of three phases, each with 
multiple steps. The three phases are: 

Phase I. Establish Evaluation Methodology and Select Biomarkers for Consideration. 
Phase II. Reduce the Candidate Biomarkers to a Set of Select Biomarkers that have 
Discriminatory Power and Perform Mixed Model Estimation of the 
Covariance Structure and Predicted Values. 
Phase III: Calculate the Discriminant Functions Using Estimated Means and 

Predicted Values and Compute Logistic Predicted Values for each Subject; 
Estimate Error Rates for the Discriminant Functions. 
Each Phase has multiple steps. Within a phase some groups of steps are iterative; that is, a 
specific set of steps may be repeated a number of times until a specified objective is achieved. 
A representative embodiment of the Phases and their steps are described in the following 
paragraphs. 
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Phase I. Establish Evaluation Methodology and Select Biomarkers for Consideration, 

The following steps would appear in a representative embodiment of the subject invention. 

Step J : Select a methodology- for estimating the procedure *s error rates. 

The methodology may incorporate any statistically appropriate method of estimating the error 
rates. Two methods, of many that may be used, are: Training sample/validation sample, and 
subsampling (or "resampling"). 

Training Sample/Validation Sample Method In the training sample/validation sample 
approach, the test population is randomly divided into, two subsets, identified herein as a 
"training sample' 1 and a "validation sample. Every subject (member of the test population) is 
assigned to either the training sample or the validation sample. The data from subjects in the 
training sample are used in the statistical analyses leading to specification of the discriminant 
procedure and probability estimation procedure. The data from subjects in the evaluation 
sample will be used to estimate the discriminant procedure's error rates and the distribution of 
the probability estimates. 

Subsampling Methods "Subsampling ,> refers to a class of statistical methods, including 
jackJcnifing and bootstrapping, that can be used to produce reduced-bias estimates of error 
rates. In a subsampling method, data from all subjects are used in the statistical analyses 
leading to specification of the discriminant procedure and/or distribution of probability 
estimates. Utilizing all the data can lead to a better discriminant procedure and/or probability 
estimation procedure than would be obtained in the Training Sample/ Validation Sample 
approach, especially: (I) if the test population is not large, or, (2) if the a priori probability 
of acquiring the biological condition is small, even with a large test population. In the 
present context, subsampling methods are computationally intensive. 

Step 2. Select the "training sample, " i.e., the subset of the test population to be used for 
statistical analyses leading to the discriminant procedure/probability estimation 
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procedure, and the "validation sample, " which is the complementary subset. 

If a subsampling method is to be used, data from all subjects are used in the statistical, 
analyses leading to specification of the discriminant procedure and/or distribution of 
probability estimates. In this case, the "training sample 1 ' is the entire test population. 

if the Training Sample/Validation Sample method is to be used, the training sample will 
contain, approximately, a specified proportion of the test population. In many cases the 
training sample proportion will be 50%; however, other proportions may also be used. The 
validation sample will contain all subjects not included in the training sample. 

The random assignment of subjects to the training sample will typically be stratified on 
subject ace. Subject ages are classified into appropriate intervals; an age-group stratum 
consists of all subjects whose age falls in the specific age interval. Intervals are selected so 
that the number of subjects in each stratum is adequate for the statistical analyses. Within an 
age-group stratum subjects will be randomly assigned to the training sample or validation 
sample. The randomization is organized to achieve, approximately, the specified proportion 
of subjects in the training sample. For example, if the training sample is specified to include 
75%. of the test population, approximately 75% of the subjects would be randomly assigned 
to the iraining sample within each age-group stratum. For example, if "65 years <> age < 70 
years" specifies one age-group stratum, approximately 75% of the subjects in this stratum 
would be randomly assigned to the training sample. 

The validation sample, if any, consists of all test population subjects that are not in the 
training sample. 

Step 3: Compile a list of Potential Biomarkers chat are potential discriminators. 

The goal of this step is to compile list all reasonable, potentially useful biomarkers, which 
will be called Potential Biomarkers. In a representative embodiment, the list of Potential 
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Biomarkers will include all recorded, quantitative, personal characteristics of subjects in the 
test population. The list will include characteristics that do not change over time (e.g., date of 
birth) as well as time-dependent characteristics, such as body weight or a lab assessment from 
blood or urine. Non-quantitative characteristics, e.g., the name of the subject's favorite color. 
5 will be excluded. 

Some of the Potential Biomarkers listed in Step 3 will not be useful for discrimination. The 
remaining steps of this Phase compile a set of "Candidate Biomarkers, " from the Step 3 list of 
Potential Biomarkers. Each Candidate Biomarker will be selected because there is 

10 information from previous research/knowledge, or quantitative evidence from the training 
sample data, that the biomarker is a potentially useful discriminator. At each step, a 
biomarker that is selected as a candidate is removed from the list of Potential Biomarkers and 
moved to the set of Candidate Biomarkers. The reason for removing a selected Candidate 
Biomarker from the list of Potential Biomarkers: once a biomarker has been selected as a 

15 candidate there is no reason to reconsider it; it has already ''made the list." At the end of the 
process, all unselected Potential Biomarkers will be removed from further consideration; only 
the Candidate Biomarkers will be subjected to additional analyses. 

Step 4: Initiate the set of Candidate Biomarkers by including any Potential Biomarkers that, 
20 on the basis of previous research and experience, are confidently believed to be 

related to the specified biological condition. 

The objective of this step is to utilize prior information on biomarkers that are potentially 
important discriminants for the specified biological condition. For example, if the specified 
25 biological condition is acquiring coronary heart disease (CHD) within a specified time, 

previous research has shown that values of serum cholesterol, systolic blood pressure, glucose 
intolerance, or cigarette smoking (to name just a few) are related to onset of CHD and should 
be copied from the list of Potential Biomarkers to the list of Candidate Biomarkers. 

30 Any reliable source of information or "educated guess' may be relied upon to select the subset 
of biomarkers known or believed to be related to the specified biological condition. Although 
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the identity of the biomarkers initially selected is not critical to determining the identity of the 
subset that is ultimately selected for use in discrimination, the initial selection of biomarkers 
that are ultimately confirmed by this system as having the greatest statistical significance for 
predicting the specified biological condition will assist in providing more rapid convergence 
to the empirically determined subset. In other words, the more educated the initial selection, 
the more rapid the convergence. 

Step 5: Add to the list of Candidate Biomarkers any Potential Biomarkers that are 

"statistically significantly" correlated with the "known important" biomarkers from 
Step 4. 

Data from the training sample are used to compute a correlation coefficient between each 
previously identified Candidate Biomarker (which are ''known important" biomarkers) and 
each Potential Biomarker. Any statistically valid correlation coefficient may be used. 

The goal is to identify biomarkers that may be good discriminators. A correlate of a "known 
important" biomarker may be a better discriminator than the "known important" biomarker 
itself At the least, correlates of known important biomarkers should be included in the initial 
analyses. 

If the specified biological condition is actually defined by values of one or more biomarkers, 
(e.g., hypertension), the defining biomarkers would be "known important" biomarkers and 
would have been moved to the list of Candidate Biomarkers in Step 4. Correlates of the 
defining biomarkers would be moved to the list of Candidate Biomarkers in this Step. 

"Statistical significance" is used here only as a tool for deciding between "probably 
important" and "probably unimportant" correlates. In a representative embodiment, a 
traditional /?-value will be computed for a correlation between a Potential Biomarker and a 
Candidate Biomarker. Ifp is less than some specified value, e.g., p<0.05, or p<0.01, the 
Potential Biomarker is moved to the Candidate Biomarker list. 
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Step 6: Fit a logistic regression model for each Potential Biomarker, using a binary 

indicator variable for the specified biological condition as the dependent (Y) variable 
and age and the Potential Biomarker as the independent (X) variables. Add to the list 
of Candidate Biomarkers each Potential Biomarker that is "statistically significant " 
in its logistic regression model. 

The objective of this step is to select as Candidate Biomarkers those Potential Biomarkers 
that are related to the probability of acquiring the specified biological condition, after taking 
the (linear) effect of age into account. The logistic model expresses the probability of 
acquiring the specified biological condition as a function of the value of the Potential 
Biomarker, in conjunction with a subject's age. 

A biomarker is selected (or not) on the basis of a marginal p-value for the biomarker's slope 
in the logistic regression model. As with the correlations above, "statistical significance" is 
used here only as a tool for deciding between "probably important''' and "probably 
unimportant" discriminators, in a representative embodiment, a traditional p-value will be 
computed for the slope of a Potential Biomarker. If p is less than some specified value, e.g., 
p<0.05, or p<0.0/,the Potential Biomarker is moved to the Candidate Biomarker list. 

Step 7: Evaluate each longitudinally-assessed Potential Biomarker, using a general linear 
mixed model ("MixMod") to assess whether longitudinal trends in the biomarker 's 
values are related to acquisition of the specified biological condition. Each Potential 
Biomarker with a statistically significant longitudinal trend is moved to the list of 
Candidate Biomarkers. 

The goal of this step is to identify biomarkers, other than those previously promoted to 
Candidate Biomarker status, that have longitudinal trends that are related to the probability of 
acquiring the specified biological condition. 

In a typical embodiment of the subject invention, each model will be created as follows. The 
dependent variable (Y) in the MixMod contains longitudinal values of the Potential 
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Biomarker. The independent (X) variables for fixed effects are: (1) a binary indicator variable 
for the specified biological condition, (2) age or another relevant longitudinal metameter such 
as time since some germane event, visit number, etc., and (3) the interaction between the 
binary indicator variable for the specified biological condition and the longitudinal 
metameter. The random effects part of the model includes a random subject increment to the 
intercept of the population regression line and, in some cases, a random slope with respect to 
the longitudinal metameter. When two or more random effects are included, the covariance 
matrix of the random effects is typically unstructured. Age or another relevant longitudinal 
metameter is included in the model for the same reasons as in Step 6. 

If the coefficient corresponding to any. A"- variable other than age is statistically significant, the 
Potential Biomarker is moved to the list of Candidate Biomarkers. The remarks on statistical 
significance in Step 6 are applicable here. 

At the end of Steps 4-7, all Potential Biomarkers have been examined and each biomarker 
with historical or quantitative evidence of utility as a discriminator has been moved to the list 
of Candidate Biomarkers, 

Phase II. Reduce the Candidate Biomarkers to a Set of Select Biomarkers that have 

Discriminatory Power and Perform Mixed Model Estimation of the Covariance 
Structure and Predicted Values. 

Background. Prior art discriminant analysis methodology typically requires relatively precise 
estimates of the mean vectors, p.,, and covariance matrices, E t , of the distributions of the 
biomarkers (and other variables, such as age and demographics) of the two groups, Group D 
(/=1) and Group D (i-2). The ^ are estimated as simple sample means (vectors) and the S, 
are estimated as simple sample covariance matrices, which do not permit adjustment of the 
mean for important concomitant variables (or "covariates") and does not readily include 
repeated measures from the same subject. Moreover, prior art discriminant analysis is 
typically based upon a "casewise deletion" procedure: if a subject has any missing data, all of 
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that subject's data are deleted from the analyses. 

Given estimates of the mean vectors, , and co variance matrices, £„ and the biomarker 
(and related' data) for a subject in a vector, Y, the traditional discriminant functions (linear if 
2, = 2 2 , quadratic if 2, * 2 2 ) are evaluated solely from Y, ji,, p^, E,, and 2 2 . The- only 
information specific to the particular subject is in the vector Y. 

The mixed model procedure, which is the greater part of Phase II, improves the traditional 
procedure by using a general linear mixed model (MixMod) to model all of n,, jj^, 2,, and 
2 2 ; the modeled estimates of these parameters are used in the discriminant function rather 
than the traditional simple, unmodeled estimates. This MixMod procedure makes the 
following important improvements over traditional discriminant analysis: 

• The parameters are estimated using a Mixed Model, that: 

♦ uses all available data, i.e., does not use casewise deletion; 

♦ supports covariate adjustment of the estimated expected values (m), with 
corresponding adjustment of the estimated covariance matrices 2,, and 

♦ supports the utilization of repeated measures (e.g., from annual visits) from the 
same subject. 

• This MixMod procedure utilizes model-based estimates of individual random effects 
and ''BLUPs" ("Best Linear Unbiased Predictors'*), in addition to or in place of the 
estimates of the population means ju,, which can substantially increase the 
discrimination capability of the discriminant function. 

Overview of the Phase II Procedure 

As a result of Phase I, each Candidate Biomarker will have historical or quantitative evidence 
of utility as a discriminator. However, there are substantial correlations among the Candidate 
Biomarkers. Consequently, a biomarker that, considered by itself, has substantial 
discriminatory power, may not make a substantial contribution when used in combination 
with other biomarkers. In addition, the scales of the biomarkers may vary widely. 
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The objectives of Phase II of the subject procedure are to: 

( 1 ) Rescale the biomarker values so that standard deviations of all rescaied biomarkers are on 

the same order of magnitude (0 < standard deviation s I ). 

(2) Reduce the possibly long list of Candidate Biomarkers to a smaller number of "Select 

5 Biomarkers; 1 each of which contributes substantially to the discriminatory power of 

the set. , 

(3) Determine the structure of the expected value of the vector Y of (rescaied) biomarker 

values using a linear model of the form E[Y] ~ Xp, and estimate p, a vector of 
unknown parameters. 

10 (4) Determine the structure of the co variance matrix of the vector Y of (rescaied) biomarker 

values using a model of the form 2 = ZAZ' + V, and to estimate the unknown 

covariance parameters in the matrices A and V. 
(5) Estimate the random subject effect vector, d ik , and compute the predicted-value 

vector, of the Ar-th subject, as if that subject came from the f-th specified 
15 biological condition group; /=/ corresponds to Group D and i~2 corresponds to 

Group D. 

In a representative embodiment of the subject invention. Step 1 of this Phase is performed 
once in order to rescale the biomarker data and arrange the data into one data vector (or one 
20 variable in a dataset). Steps 2 and 3 are performed iteratively until the set of Select 

Biomarkers has been selected and the estimates listed above have been computed. Step 4 
refines the mixed model and parameter estimates to be used in the discrimination by selecting 
appropriate models for the covariance matrices. 

25 Step I: Prepare a dataset in which one variable, "RespScal, " contains scaled values 

(including longitudinal measures) of all Candidate Biomarkers from all subjects. 

The scaling is performed separately for each biomarker. Each biomarker value is divided by 
the sample standard deviation of that biomarker. Thus, the standard deviation of the scaled 
30 values of each biomarker is 1 .00. In a representative embodiment of the subject invention 
the one variable of biomarker values may be named "RespScar\ an abbreviation of 
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"Response — Scaled"). The sample standard deviation of RespScal is also approximately 
1 .00. This scaling facilitates convergence of the iterative procedure in subsequent mixed 
model computations. 



5 Step 1 is executed only once. Initially, all Candidate Biomarkers have data in RespScal and 
are considered members of the set of Select Biomarkers. Non-discriminating biomarkers will 
be removed from the Select Biomarkers in Steps 2-3. 

Step 2: Fit a general linear mixed model (Mix Mod) using the specifications listed below; 
10 obtain estimates of the parameter matrices £$, A, and V, obtain estimates of each 

subject *s random subject effects, d lh and each subject 5 "predicted values, " Y it . <ttun) 
and Y ik (av8) as if the subject were in each specified biological condition group, i~L 2. 

In a representative embodiment of the subject invention the following are specifications of 
15 the MixMod: 

Dependent (Y) variable: RespScal ; 

Independent (X) variables and their coefficients (p): 

"Biological Condition Status," an indicator variable for the status of the 
specified biological condition (classification variable); Biological 
20 Condition Status = 1 if the corresponding element of Y contains 

information about a subject from Group D and Biological Condition 
Status - 0 otherwise. 
Biomarkers 1 indicator variables (classification variables); 
Biological Condition Status * Biomarkers 1 indicator variables (classification 
25 variables); 

Age (in years, centered at approximately the overall mean age of subjects; 
continuous variable); 
Random effects variables (Z k ) and random coefficients (effects, d lk ): 

Subject x Biomarker indicator variables (part of Z k ) and corresponding 
30 random effects (intercept increments; pan of */ lk )- 

The random subject effect for a specific biomarker is constant across 
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that subject's multiple visits, which generates correlations 
among repeated measurements of that biomarker for that 
subject- 
Note that the model assumes E[d lk ]~0 and V[d t J^L. 
Covariance matrix, V k -V(e kb ), of the vector e ktJ of biomarker random error terms, e kbv 
, for the k-th subject at the v-th longitudinal evaluation of the Z>-th scaled 
Candidate Biomarker. This covariance matrix has one row and one column for 
each longitudinal evaluation of each biomarker for the k-lh subject. Note that 
the model also assumes E[e kt J=0. 

The primary interpretation of e kbv is as a "random measurement error term," 

representing variation, from one evaluation to another, of a value of the 
scaled Candidate Biomarker about subject k's age-dependent mean 
value for that scaled Candidate Biomarker, With this interpretation, it 
is often reasonable to assume that values of € kbv are homoscedastic and 
are uncorrected, i.e., Cov(e kbv , e k . b . v .) = 0 if(£,Z>,v) * (k\b\v r ). If the 
elements of Y are sorted by k (subject ID), b (biomarker ID), and v 
("visit" or evaluation number or age of subject), then a reasonable 
model for F k in many cases is V k » BlockDiag(J / kb ) = BIockX)iag(J / kl , 
V k2 , ...), where V kh = A. b I and X b =V(e kbv ), the variance of measurement 
errors for scaled values of the b-th Candidate Biomarker, which 
variance is assumed to be the same for all subjects (k) and all 
evaluations (v). 

Note that the scaling of RespSca! implies that each variance, l b , will be less 
than 1 .00. The extent to which the variance is less than 1.00 depends 
upon the magnitudes of the fixed effects (a high R 2 leads to a smaller 
estimated variance) and the magnitudes of the variances of the random 
effects (diagonal elements of A). 

Note the above combination of Z k , d v , V v = BlockDiag(K kb ) and V kb - X b I 

generate a highly structured, extended compound symmetric model for 
S ik . To illustrate the point in an example when the same covariance 
parameters apply to both Group D and Group D, let </ k = [d kb ] = [d k] , 
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rf k2 , ]' be the vector of random effects for the k-ih subject and b-ih 
scaled biomarker. let V(rf fc ) = A = [6 bh -], where 6 bb - = Cov{d kb , d^) 
where b and b ' index possibly different scaled biomarkers, let Z v 
contain indicator variables for the scaled biomarkers, and let V kb = X b l 
Then 2, - Z k AZ\ - V k « [2 k . bb .], where 2 k<bb - 6 bb J + X b \ - 
covariance matrix of multiple measurements from scaled biomarker b. 
and 2 Lbb - = 6 bb -J = covariance of scaled biomarkers b and 6.' evaluated 
on the same occasion or on different occasions. (Each element of the 
square matrix J equals 1.) 



The process of fitting the mixed model produces estimates of : • 

The model's parameters, p, A, and parameters of If the model assumes different 
covariances for the two Biological Condition Status groups, the model . 
15 produces separate estimates of the covariance parameters in Aj and V A , 

The expected value of each subject's data vector, ju jk , (subject k being in Biological 

Condition Status group r), 
The expected value of each subject's. data vector, \x v ^ as if the subject were in the 
other response group (ij, 
20 Each 'subject* s random subject effect in the subject's actual treatment group (/), 

and also as if the subject were in the other response group (i d rk . 
Each subject's ''predicted values/' in the subject's actual treatment group (/'): Y lk fp \ and 

also as if the subject were in the other response group (V)\ Y t f p) . 
The subject's covariance matrix, 2 V . If the model assumes different covariances for 
25 the two Biological Condition Status groups, the model produces separate 

estimates of the covariance matrices 2 ik .' 

Step 3: Delete the biomarker that has the least apparent discriminant power and re-fit the 
mixed model. 



A biomarker that will be an effective discriminant should have a large (statistically 
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.significant) Biological Condition Status * Biomarker fixed effect, in contrast, a large 
Biomarker main, effect is not relevant here: a large Biomarker main effect - indicating 
differences among biomarker means - can arise simply because the biomarkers are different 
types of variables and have different means (on the rescaled axis). In contrast, a large 
Biological Condition Status * Biomarker effect indicates that the biomarker mean for the 
Biological Condition Status - 0 (Group D) is significantly different from biomarker mean for 
the Biological Condition Status ~ 1 (Group D) mean for the same biomarker. Such a 
difference should make an important contribution to the discrimination procedure. 

If each current Selected Biomarker has a statistically significant Biological Condition Status 
>' Biomarker fixed effect, Step 3 is completed and we move to Step 4. If one or more current 
Select Biomarkers has a not-statistically-significant Biological Condition Status * Biomarker 
fixed effect, the biomarker with the least statistically significant (largest ^-value) Biological 
Condition Status x Biomarker fixed effect is removed from the data vector, Y, and we return 
to Step 2 where a MixMod is fitted to the reduced data vector. 

The strategy being implemented in Step 3 is an analog of a "backwards elimination' 1 
procedure in the stepwise regression context. An alternative is to implement an analog of 
"forward selection/' in which one initially includes only a very small number of clearly 
effective discriminants (biomarkers) in the data vector and model and. at each subsequent 
step, adds one more biomarker. 

Step 4: Determine the structures of the covariance parameter matrices, A„ and V lJr 

Discriminant analysis methodology uses both the expected values of the biomarkers and the 
covariance matrices of the biomarkers (some of which may be evaluated longitudinally) 
separately for each Biological Condition Status group, D and D. Recall that the list of Select 
Biomarkers, including possible longitudinal assessments, already will have been finalized in 
Step 3. As noted above, a MixMod incorporates assumptions that lead to the following 
structure for the covariance matrices: 2 ik = Z^AjZ'^ + V ik , where / indexes Biological 
Condition Status group (/==! for Group D, for i-2 for Group D) and k indexes subjects. In 
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addition, the covariance parameter matrices A, and V lk may have structure that can be 
exploited in the analysis, especially when £ lk is very large, i.e., when there are many 
biomarkers and/or many longitudinal assessments of one or more biomarkers. 

The objective of Step 4 is to determine the structure of the covariance parameter matrices A, 
and V ik for use in the Phase III discriminant analyses. Estimates of large, structured 
covariance parameter matrices tend to be more precise than estimates of unstructured 
covariance parameter matrices. A more precise estimate of A, and/or V ik leads to a more 
precise estimate of S ik ~ Z^.AjZ 7 * + V lk , thence to more precise estimates of p, the </,>., and 
the Y,l pi , and to more precise values of the discriminant function. 

The overall structure of 2 tk . must take into account the following types of go variances/ 
correlations; 

Type ADB: Covariances/correlations among different biomarkers evaluated at the 
same time point; 

Type ALESB: Covariances/correlations among longitudinal evaluations of a single 
biomarker; 

Type BTBEL: Covariances/correlations between two biomarkers, evaluated 

longitudinally, i.e., covariances/correlations between any pair of biomarkers, 
one evaluated at one time and the other evaluated at a different time. 

In a representati ve embodiment of the subjecrinvention, the structures described in Step 2, 

above, or extensions of these structures may be useful. 

In a representative embodiment of the subject invention, the techniques described in Tangen, 
Catherine M., and Helms, Ronald W., (1 996), "A case study of the analysis of multivariate 
longitudinal data using mixed (random effects) models," presented at the 1996 Spring 
Meeting of the International Biometric Society, Eastern North American Region, Richmond, 
Virginia, March, 1996, are used to explore covariance/ correlation structures for longitudinal 
multivariate data. Selecting a covariance model typically requires fitting a number of 
MixMods, typically using the same expected-value model and varying the covariance model. 
Models may be compared via Log Likelihood statistics (assuming underlying normal 
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distributions). Covariance structures may also be compared graphically using techniques 
developed by Ronald W. Helms at the University of North Carolina, e.g., Grady, J. J. and 
Helms. R. W. (1995), "Model Selection Techniques for the Covariance Matrix for Incomplete 
Longitudinal Data." Staristics in Medicine. 14, 1397-1416. 

Phase III: Calculate Discriminant Functions Using Estimated Means and Predicted 

Values and Compute Logistic Predicted Values for each Subject; Estimate Error 
Rates for the Discriminant Functions 

ikwkvrniuul. The objective of Phase III is to "predict" which ''population' 1 or group a subject 
will belong io, Group D or Group D: 

• Group D: That subpopulation of persons who will acquire the specified biological 
condition within a specified timeframe. 

• Group D: That subpopulation of persons who will not acquire the specified biological 
condition within the specified timeframe, 

A subject is classified by placing the subject into one of the following two groups: 

• Group PD: That group of persons who, at the beginning of the specified timeframe, 
are predicted to acquire the specified biological condition within the specified 
timeframe, i.e., projected to belong to Group D. These persons are described as 
having a prescribed high probability of acquiring the specified biological condition 
within the specified timeframe. 

• Group PD\ That group of persons who, at the beginning of the specified timeframe, 
are predicted not to acquire the specified biological condition within the specified 
timeframe, i.e., projected to belong to Group D. These persons are described as 
having a prescribed low probability of acquiring the specified biological condition 
within a specified timeframe. 
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A second objective is to estimate the probabilities that a subject will belong to Groups D and 
D. 

The technology for achieving the first objective — classifying a subject into one of the two 
5 groups — uses discriminant procedures that are modifications of traditional discriminant 

analysis. The estimates of the probability that the subject will be in the group of subjects that 
will acquire the specified biological condition is obtained from a modification of traditional 
logistic regression, (1) using the discriminant function values as regressors and (2) using the 
discriminant variables as regressors. 

10 

As noted in the background of Phase II. prior an discriminant analysis methodology typically 
utilizes naive estimates of the mean vectors, n h and covariance matrices, S„ of the 
distributions of the biomarkers of the two groups. Moreover, prior art discriminant analysis 
is typically based upon a "casewise deletion" procedure: if a subject has any missing data, all 
1 5 of that subject's data are deleted from the analyses. 

The mixed model procedure, described in Phase II, improves the traditional procedure by 
using a general linear mixed model (MixMod) to model all of jx : , and IU; the modeled 
estimates of these parameters are used in the discriminant function rather than the traditional 

20 simple, unmodeled estimates. The use of the mixed model permits the present procedures to 
make the following important improvements over traditional discriminant analysis; The 
parameters are estimated using all available data, i.e., does not use casewise deletion. The 
procedure supports covariate adjustment of the estimated expected values (u^), with 
corresponding adjustment of the estimated covariance matrices E ( . And the procedure 

25 supports the utilization of repeated measures (e.g., from annual visits) from the same subject. 

Perhaps more importantly, the use of the mixed model permits the present procedures to 
utilize model-based estimates of individual random effects and "BLUPs" ("Best Linear 
Unbiased Predictors"), in addition to or in place of the estimates of the population means 
30 which can substantially increase the discrimination capability of the discriminant function. 
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The form of the present discriminants are formally identical to the traditional discriminant 
based upon multivariate normality. Some notation is useful; let: 

f x denote the density function of the distribution of the vector Y of discriminant 

variables for a subject from group L evaluated using "estimates'* of |i, and 2„ / 
= 1 for Group D or Group PD, /-2 for Group D or Group PD ; 

p- x denote the a priori probability that a subject will come from group /. / = 1 for Group 
D, i—2 for Group D. The values of the p x are often known from historical data 
or other research- If the values of the p t are unknown, the proportions of the 
subjects in the two groups may be used as estimates of the /? r 

Then a subject of unknown group with vector Y of discriminant function values' would be 
classified into group 1 (Group PD) if Ln[f } (Y)/f 2 (Y)J > Ln[pJ pj and would be assigned to 
group 2 (Group PD) otherwise. 

In Phase II one will have decided whether one can reasonably assume the two groups have 
equal covariance matrices, 2, = 2 3 - E. In that case, the present discriminant procedure 
reduces to use of a linear discriminant function of the following form: 

D(Y) - [Y - l / 3 (u, + u 2 )y 2-' (u, - u,) - LnfpJpJ 

where the ^ and E ; are replaced by "appropriate" estimates to be discussed below. One 
compares D(Y) vs. 0. If, in Phase II, it was decided that Ej * E 2 , the discriminant procedure 
reduces to use of a quadratic discriminant function of the following form: 

QOO = V* ln(\L 2 \ I 12, |; - '/aOWEf 1 (Y-jii) + ^(Y-^'Sf 1 (Y-ft) - Ln[pJ pj 

where the ^ and 2 t are replaced by "appropriate" estimates to be discussed below. One 
compares Q(Y) vs. 0. 

In either case, the "appropriate" estimates come from the mixed model procedure in Phase II 
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and may or may not include random subject effects. 



20 



25 



Phase III Procedure The steps of Phase III of the procedure are described below. It is 
assumed that data are available from one or more "new" subjects, i.e.. subjects whose group 
membership is unknown and that were not used in the Phase II mixed model computations. 
In Steps 1-2 we shall consider one subject at a time. Some additional notation is useful. Let / 



= 1 for Group D or Group PD, i—2 for Group D or Group PD and let: 



Y denote the vector of values of the discriminant variables for one new subject. The 
10 elements of Y are scaled as RespScal was scaled in Phase IL 

X, denote the matrix of values of the independent variables used in the final Phase II 
mixed model, as if the subject were in group /, / = 1 , 2. Note that the rows of 
X { correspond to the rows (elements) of Y. 
Z t denote the matrix of values of the random effect variables used in the final Phase II 
1 5 mixed model, as if the subject were in group f, / — 1 , 2. Note that the rows of 

■ Zi correspond to the rows of Y. 
i£j denote the estimated covariance matrix of the random effects from group /, i = 1, 
2, from the final Phase II mixed model. Note that in many cases the mixed 
model reduced to a single covariance for the random effects, /. 



denote the estimated covariance matrix of the random residuals or "error terms" 
from group /, / = 1,2, from the final Phase II mixed model. Note that in many 
cases the mixed model reduced to a single covariance matrix, i.e., 

= Zj Z t ' + ^ denote the estimated covariance matrix of Y, from the final Phase 
II mixed model, as if the new subject came from group /, / = 1,2. Note that in 
many cases the mixed model reduced to a single covariance matrix, i.e., £ 3 = 
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Step I : Using results from the Phase II mixed model, classify all subjects in the validation 
sample and estimate the error rates of multiple candidate discriminant procedures, 
one based on "estimated values, " and others based upon "predicted values " utilizing 
various combinations of the estimated random subject effects. The procedure with the 
lowest estimated error rate is selected procedure and is referred to as the "apparently 
most reliable procedure. " 

If the original study population was divided into a "training sample" and a ''validation 
sample," use the validation sample in the following; otherwise use the training sample as the 
"validation sample/ 1 Estimate the following quantities for each subject in the validation 
sample, separately, as if the subject came from each group. 

^, = X; the "estimated value 11 of Y, as if the subject came from group /, / = 1, 2. 
& { - Z'£; l (Y' X t £), the estimate of the subject's random subject effect, as if the 

subject came from group /, / = 1, 2. 
c£ min = ci, if ti,'^,* 1 c£, £ t^if,' 1 ^; otherwise £ = £ 2 . $ min may be thought of as the 

"minimum" of and c? 2 , or the "minimum (over groups) random subject 

effect" estimate. 

(J a , g = (tiY+ ^V^. ^ avg may be thought of as the "average" of and or the 

"average (over groups) random subject effect*' estimate. 
Y/ n " n/ = X, p + Z, d mtn , the subject's "predicted values," as if the subject came from 

group / = 1,2, but using the "minimum 11 random subject effect estimate. 
Y t (avg > = Xj $ + Zj ci avg , the subject's "predicted values/ 1 as if the subject came from 

group /', / = 1,2, but using the "average 11 random subject effect estimate. 

In the above and below, / - I for Group D or Group PD, /=2 for Group D or Group PD . 
Classification based upon the estimated values, 

♦ If the decision S, - S 2 ~ 23 was made in Phase II, evaluate the linear discriminant 
function, D(Y) (above), substituting for jx, and £ for 2. Assign the subject to 
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group 1 (Group PD) if D(Y) z 0: otherwise assign the subject to group 2 (Group/* D 
)■ 

If the decision S, ^ S 2 was made in Phase II, evaluate the quadratic discriminant 
function, 0(Y) (above), substituting ^, for ji, and for 2,, / = 1.2. Assign the 
subject to group 1 (Group PD) if 0(Y) 2 0\ otherwise assign the subject to group 2 
(Group PD ). 

Classification based upon the "minimum " random subject effects and predicted values, 

%f imm) 

• If the decision E ( = En = 2 was made in Phase II, evaluate the linear discriminant 
function, D(Y) (above), substituting Y, fmfn) for fi, and £ for 2. Assign the subject to 
group 1 (Group PD) if D(Y) ^ 0\ otherwise assign the subject to group 2 (Group PD 

y 

If the decision S, * S 2 was made in Phase II, evaluate the quadratic discriminant 
function, 0(Y) (above), substituting Yl m,m for ^ and 2?, for 2,, / = 1 , 2. Assign the 
subject to group \ (Group PD) if 0(Y) 2 0\ otherwise assign the subject to group 2 
(Group PD ).. 

Classification based upon the "average" random subject effects and predicted values, Yf tn '^: 

• If the decision 2^ = 2-> = 2 was made in Phase II, evaluate the linear discriminant 
function, D(Y) (above), substituting y/"*' for \x { and £ for 2. Assign the subject to 
group 1 (Group PD) if D(Y) z 0\ otherwise assign the subject to group 2 (Group PD 

y 

• If the decision 2, * 2 : was made in Phase II, evaluate the quadratic discriminant 
function, Q(Y) (above), substituting Y, ,av *> for ji, and t t for 2 j} / - 1,2. Assign the 
subject to group 1 (Group PD) if 0(Y) z 0\ otherwise assign the subject to group 2 
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(Group PD ). 

After each subject in the validation sample (as defined above) is classified, compute a 2 * 2 
table, similar to the following, for each of the three procedures (based on estimated values or 
based upon predicted values): 



Numbers of subjects in the 
validation sample tabulated by 
actual and classified membership 
in D. 


Subject was classified as a member of Group: 


PD 


PD 


Subject was actually 
a member of Group: 


D 


yV M = Number of true 
negative classifications 


N x2 ~ Number of false, 
positive classifications 


D 


N 1{ - Number of false 
negative classifications 


N Z2 — Number of true 
positive classifications 



Further, compute separately for classification based on estimated values and for classification 
based upon predicted values: 

r pp ~ A' i2 /jV|, - false positive error rate = proportion of false positive classifications 
^fn ~ /A/> = false negative error rate = proportion of false negative classifications 
r tov = (N u + ^21 + + A^2-) ~ tota ' error rate - proportion of false classifications 

In a typical embodiment of the subject invention, one will compare the three types of 
classification procedures, i.e., the one based on estimated values, ^„ the one based on 
"minimum" predicted values, Yf nun \ and the one based on "average" predicted values, K/ <n ' R ', 
to determine the "apparently most reliable procedure/" Some considerations in the selection 
process are: 

• If a false negative classification has substantially more serious consequences than a 
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false positive classification, select the procedure with the smaller false negative error 
rate, ;> N . This situation could arise, for example, if Group D is the subpopulation of 
persons who will suffer a myocardial infarction ("MP) within a specified Five year 
age group. A false negative classification, failure to warn a person of a high MI 
probability, could have more serious consequences than a false positive classification, 
warning a low-probability person that they have a high MI probability. 

• Conversely, if a false positive classification has substantially more serious 
consequences than a false negative classification, select the procedure with the smaller 
false positive error rate, r FP . 

• When there is no a priori reason to assign greater seriousness to either a false negative 
or a false positive classification, select the procedure with the smaller total error rate, 

^ tor 

The procedure selected as the apparently most reliable procedure is used to classify subjects 
into the two groups, Group PD and Group PD . 

Step 2: Use two types of logistic regression to compute estimates of the probability that a 
new subject will belong to each group. 

The data from the training sample are used to fit a logistic regression model in which the 
value of the discriminant function (D(Y) if linear, Q(Y) if quadratic) for each subject is used 
as the independent (",Y") variable and the Biological Condition Status (indicator variable for 
membership in Group D) as the dependent ("P') variable. The model is used, together with 
inverse logistic transform, to compute for each subject an estimate of the probability that the 
subject will belong to Group D. 

In a separate calculation, the data from the training sample are used to fit a logistic regression 
model in which the biomarkers used in the discriminant function, together with the final 
mixed model covariates (variables in X), are incorporated as independent ( U A"') variables and 
the Biological Condition Status (indicator variable for membership in Group D) as the 
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dependent (T") variable. In addition to obtaining the usual logistic regression model 
estimates, the model is used, together with inverse logistic transform, to compute for each 
subject an estimated probability that the subject will belong to Group D. When longitudinal 
data are used, the model is used to estimate the probability that the subject will belong to 
Group D at the end of the specified period. One can use a generalized estimating equation 
approach with a logistic link function to accommodate correlations among the multiple 
binomial outcomes from one subject. 

The predicted probabilities from these two models can provide interesting interpretations of 
discriminant function values. 

While the subject algorithm is the preferred embodiment for determining the discriminant 
function to be used in the subject, it is to be understood that this algorithm is provided solely 
for the purpose of illustrating the preferred embodiment of the subject invention, and in no 
case is it to be understood that the subject invention is limited to the steps or substeps of the 
algorithm described herein. For example, it is to be understood that in the art and field of 
discriminant analysis methodology, there are other types of discriminant functions, e.g., so- 
called "optimal discrimination," other types of regression, e.g., nonlinear mixed models, etc, 
that may also be used while falling fully within the scope and spirit of the subject invention. 

This invention will now be described in detail with respect to specific representative 
embodiments thereof, the materials, apparatus and process steps being understood as 
examples that are intended to be illustrative only. In particular, the invention is not intended 
to be limited to the statistical methods, materials, conditions, process parameters, apparatus 
and the like specifically recited herein. 

AN EXAMPLE OF THE PREFERRED EMBODIMENT 

The attached tables and Figure present the results of an illustrative analysis of data using the 
methods and procedures of the subject invention. 
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The data used as a basis for this example were obtained from a database including patients for 
whom Sickle Cell data are acquired on an annual basis. Some patients have data from three 
consecutive visits. However, since patients typically cannot be compelled to participate 
annually, the database includes many patients for whom data are available from only one or 
5 two annual visits. Database information that was used here included demographic data, 
clinical chemistry data, and hematological data. 

The specified biological condition of interest (the "disease" or "affliction") in this example 
was an occurrence of a painful crisis that required hospitalization. At each annual visit the 
1 0 subject is asked (and records are checked to determine) if the subject had a painful crisis that 
required hospitalization in the preceding year. Each subject who reported having had a 
hospitalization for a painful crisis at any visit (any year) is a member of the "Diseased" group 
(Group D); all other subjects are members of Group D. 

1 5 Whenever a subject had had a painful crisis that required hospitalization in the preceding 
year, ail data that were collected after the hospitalization for the painful crisis, in the same 
year or in later years, were excluded from the analysis. This mimics the procedure that would 
be used if the outcome were death or occurrence of a chronic, incurable disease. The variable 
that records a subject's Group D membership {e.g., diseased or not, afflicted or not) is named 

20 the "Disease Status" variable. 

The following is an example of the statistical analysis procedures using the sickle cell data. 
For reasons of confidentiality, the data used in this example are artificial arid do not come 
from a real study or from real subjects. However, the data are similar to data that could have 
25 been obtained in a study of real subjects. 
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Phase I. Establish Evaluation Methodology and Select Biomarkers for Consideration. 

Step 1 : Select a methodology' for estimating the procedure 's error rates. 

Step 2. Select the "training sample, " i.e., the subset of the test population to be used for 
statistical analyses leading to the discriminant procedure/probability estimation 
procedure, and the "validation sample, " which is the complementary subset. 

The Training Sample/Validation Sample Method was chosen for this example. 
Patients were randomly assigned to one of the two samples. The training sample was used to 
create the discriminant function; the validation sample was used to evaluate the accuracy of 
the discriminant function. 

The training sample included information from 641 "annuaF evaluations from 481 subjects, 
or about 1 .3 annual evaluations per subject. However, not all biomarkers were assessed, even 
when a subject made a visit. For an extreme example, only 88 values of Direct Bilirubin 
(variable L_DBILI) were available from only 80 subjects. 

Step 3: Compile a list of Potential Biomarkers that are potential discriminators. 

In this case, blood pressures, all available demographic data, clinical chemistry data, and 
hematological data were used as potential discriminators. The Potential Biomarkers are listed 
in Table 2. 

Step 4: Initiate the set of Candidate Biomarkers by including any Potential Biomarkers that, 
on the basis of previous research and experience, are confidently believed to be 
related to the specified biological condition. 

In the example, Platelet Count (or "Platelets") was taken as a "known" biomarker for Disease 
Status, hospitalization for a pain crisis. 
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Step 5: Add to the list of Candidate Diomarkers any Potential Biomarkers that are 

"statistically significantly" correlated with the "known important " biomarkers from 
Step 4. 

Biomarkers were selected that were correlated with the "known important" biomarker, 
platelets, from Step 2. A summary of these correlations is shown in Table -3, in the columns 
labeled "Correlation W/ Platelets". The column shows the p-values for correlations with 
Platelets. A biomarker was selected on the basis of a marginal p-value for the Pearson 
product-moment correlation coefficient. In the example,/) < 0.01 was required for selection. 
The K 'p* -cv" column indicates, by the presence of the word "YES," those biomarkers that 
became Candidate Biomarkers as a result of a "significant" correlation with Platelets. 

Step (> Fa a logistic regression model for each Potential Biomarker, using a binary 

indicator variable for the specified biological condition as the dependent (Y) variable 
and age and the Potential Biomarker as the independent (X) variables. Add to the list 
of Candidate Biomarkers each Potential Biomarker that is "statistically significant " 
in its logistic regression model. 

A logistic regression model was fitted for each biomarker, using Disease Status as the 
dependent (Y) variable and a combination of age and the biomarker as the independent (X) 
variables. In this case, for each biomarker the logistic model assessed how well the 
probability of a hospitalization for a painful crisis is described by that biomarker, in 
conjunction with the subject's age. Roughly speaking, the biomarker's regression coefficient, 
or slope, in the logistic regression will be approximately zero if there is no relationship 
between the biomarker and the probability that the subject will acquire the specified 
biological condition; a nonzero slope indicates a relationship. A summary of the logistic 
regression results is shown in Table 3, in the columns headed "Logistic Regression." The '7?" 
column shows the /?-values for the biomarker's regression coefficient. A biomarker was 
selected on the basis of a marginal />- value for the biomarker's slope in the logistic regression 
model. In the example,/? < 0.01 was required for selection. The 4i /?<cv" column indicates, 
by the presence of the word "YES," those biomarkers that became Candidate Biomarkers as a 
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result of a "significant" logistic regression coefficient. Note that some of these -biomarkers 
were also significantly correlated with Platelets and were Candidate Biomarkers before the 
logistic regressions were computed. 

5 Step ~: Evaluate each longitudinally-assessed Potential Biomarker, using a general linear 

mixed model (' 'MixMod") to assess whether longitudinal trends in the biomarker 1 s , 
values are related to acquisition of the specified biological condition. Each Potential 
Biomarker with a statistically significant longitudinal trend is moved to the list of 
Candidate Biomarkers. 

10 

A mixed model was fitted for each biomarker, using longitudinal values of the biomarker as 
the dependent (JO variable, with Age, Disease Status, and Visit Number * Disease Status as 
the independent (X) variables, and a subject effect in the random effects (Z) part of the model. 
(Visit Number and Disease Status are "classification" variables; the corresponding 
15 coefficients are increments to an intercept. In contrast, Age is a continuous variable whose 
coefficient is a slope.) The random effects part of the mixed model incorporates the 
correlations between longitudinal measurements from the same subject. The model permits 
the number of visits (longitudinal assessments) to vary from subject to subject. 

20 A biomarker could be selected if either the Disease Status "main effect" or the subvector of 
three Visit Number * Disease Status interaction coefficients was statistically significantly 
different from zero (p<0. 01). A significant Disease Status "main effect" would indicate that 
the mean of the biomarker values for subjects in Group D is different from the mean for 
subjects in Group D. A significant subvector of three Visit Number * Disease Status 

25 interaction coefficients would indicate that the time trend in biomarker values for subjects in 
Group D is different than time trend for subjects in Group D. In either case (significant main 
effect or interaction), the results would indicate that the biomarker is a potentially useful 
discriminator and should be moved to the Candidate Biomarker list. The results from the 
mixed models are shown in Table 3 in the columns headed Mixed Model. Separate results 

30 are shown for main effects and interactions, in a format similar to results from correlations 
and logistic regressions. 
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At the end of Steps 4-7, all Potential Biomarkers have been examined and each biomarker 
with historical or quantitative evidence of utility as a discriminator has been moved to the list 
of Candidate Biomarkers. The Candidate Biomarkers are indicated by the word '"YES" in 
Table 3 in the column headed "Selected/ 1 

Phase II. Reduce the Candidate Biomarkers to a Set of Select Biomarkers that have 

Discriminatory Power and Perform Mixed Model Estimation of the Covariance 
Structure and Predicted Values, 

Step 1: Prepare, a datasei in which one variable, "RespScal, " contains scaled values 

(including longitudinal measures) of all Candidate Biomarkers from all subjects. 

This step was executed for the example but the results are not shown. 1 However, note that 
when all the values of all the different biomarkers are placed into one column vector, Y, the 
vector can contain a large number of elements. 

Step 2: Fit a general linear mixed model (KdixMod) using the specifications listed below; 
obtain estimates of the parameter matrices p, A, and V, obtain estimates of each 
subject s random subject effects, d th and each subject 's "predicted values, " Y ik ! ntin) 
and Y ik (avf!) as if the subject were in each specified biological condition group, /=/, 2. 

Step 3: Delete the biomarker that has the least apparent discriminant power and re-fit the 
mixed model 

Steps 2-3 are repeated iteratively until all biomarkers in the model are statistically significant. 
In the interests of conserving space in this presentation of an example, only the final results of 
the iterations through Steps 2-3 are discussed. Steps 2*3 reduced the number of biomarkers to 
15, with Age as a fixed effect covariate. 

General information for the example mixed model is given in Table 4. Data were available 

if 
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from 48 1 patients with a maximum of three visits for each patient. Note the apparently large 
numbers of observations not used in the analysis. Artificial observations were generated with 
missing lvalues to compel the software to compute the required predicted values. The 
artificial observations with missing rvalues have no impact on the estimation of parameters 
5 or prediction of random subject effects. 

Table 5 gives the estimates of the fixed effects from the mixed model. The p-va!ue for each 
biomarker {e.g., the jp-value for "L_BUN") is a p-value for a test of the hypothesis that the 
mean value of this biomarker is the same as the overall mean, averaged over all biomarkers. 
1 0 The tact that these /?-values are significant is of little interest; one expects the mean of one 
hiomarkcr's values to be different from the mean of another biomarker's values. 

In Table 5 the /rvalue for each "biomarker X GROUP IA" interaction (<?.g., the p-value for 
"ALBUMIN X GROUP 1A") is a /rvalue for a test of the hypothesis that the mean value of 
1 5 the biomarker for Group D is significantly different from the mean value of the biomarker for 
Group D. A significant value (e.g., p < 0.05) indicates that the biomarker should be a good 
discriminator. All of the interactions in the final model represented by Table 5 are 
statistically significant (all p s 0,05). Age was forced to remain in the model even though the 
/rvalue is not significant. 

20 

Subject-, biomarker-, Disease Status ("Group")-, and visit-specific observed and predicted 
values for subject 447 are shown in Table 6. This subject was in Group D ("GROUP 
D? M *NO; note "RESPSCAL" is missing for rows with "GROUP D?"=YES), but we have 
Predicted values for both groups. Note also that this subject had no data for biomarker MCH 
25 or MCHC for Visit 2, but we have model-based predicted values for that subject's Visit 2 
MCH and MCHC. 

The strategy implemented in Steps 2-3 is an analog of a "backwards elimination" procedure 
in the stepwise regression context. An alternative would be to implement an analog of 
30 "forward selection," in which one initially includes only two (or very small numbers of) 

clearly effective discriminants (biomarkers) in the model and, at each subsequent step, adds 
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one more biomarker. 

Step 4; Determine the structures of the covariance parameter matrices, A,, and V lh 

As noted above, the overall structure of S ik must take into account three types of co variances/ 
correlations: 

Type ADB: Covariances/correlations among different biomarkers evaluated at the 
same time point; 

Type ALESB: Covariances/correlations among longitudinal evaluations of a single 
biomarker; 

Type BTBEL: Covariances/correlations between two biomarkers, evaluated 

longitudinally, i.e., covariances/correlations between any pair of biomarkers, 
one evaluated at one time and the other evaluated at a different time. 
In the example the following structures were ultimately obtained: 

Identical random effects covariance parameter matrices for both Group D and Group 
D, i.e., A | = A 2 ~ A and 

A has compound symmetric structure, 6,; = 0.6669, 5^ = 0.0097 for i*j . 

Type ADB covariances in matrix V, which is the same for both Group D and Group 
D, and compound symmetric structure, v H — 0.3267, v^O.0151 for i*j . 

This covariance structure was reasonable given the sickle cell data at hand. 

Estimates of A and V are shown in Table 7. The estimate of A, the covariance matrix of the 
random subject effects, is in the top of the table. The rows and columns correspond to the 15 
biomarkers used in this model; the columns are labeled. 

The estimate of V, the covariance matrix of the within-subject, within-visit errors, is in the 
bottom of the table. As with A, the rows and columns correspond to the 1 5 biomarkers used 
in this model, V has compound symmetric structure, which is reasonable for the scaled data. 

Phase III: Calculate Discriminant Functions Using Estimated Means and Predicted 
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Values and Compute Logistic Predicted Values for each Subject; Estimate Error 
Rates for the Discriminant Functions 

Step I: Using results from the Phase II mixed model classify' all subjects in (he validation 
5 sample and estimate the error rales of multiple candidate discriminant procedures, 

one based on "estimated values, " and others based upon "predicted values " utilizing 
various combinations of the estimated random subject effects. The procedure with the 
lowest estimated error rate is selected procedure and is referred to as the "apparently 
most reliable procedure. " 

10 

The present procedures were applied using the mixed model results for the sickle cell data. 
Since the covariance parameter matrices were modeled to be equal for Group D and Group D, 
each discriminant was a linear discriminant. Each discriminant was applied to the subjects in 
the training sample (used here as a validation sample), projecting each subject to belong to 
15 either Group PD or Group PD * 

. An evaluation of the subject linear discriminant function based on estimated values is shown 
in Table 8. Of 179 subjects in Group D, the Disease Status = "No" group, 100 (56%) were 
correctly classified by the discriminant into Group PD and 79 (448%) were incorrectly 
20 classified into Group PD, Of 262 subjects in Group D, the Disease Status = "Yes" group, I 88 
(72%) were correctly classified into Group PD and 74 (28 %) were incorrectly classified into 
Group PD . Overall, of 441 subjects, 288 subjects (65%) were correctly classified and 35% 
were misclassified. 

25 Table 9 displays an evaluation of the subject linear discriminant function based on predicted 
values using the minimum random subject effect. Table 9 is similar to Table 8. Prediction 
discrimination led to a slight improvement of discrimination in Group D, but slightly worse 
results in Group D. Overall, the error rate was approximately the same. 

30 The classification/misclassification statistics in the preceding paragraph and in Tables 8-9 are 
optimistically biased, that is, the table provides a more favorable estimate of misclassiftcation 
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rates than are likely to occur in practice, because the training sample was used both to derive 
the discriminant function and to evaluate it. Evaluation of the discriminant function using the 
evaluation sample will produce unbiased estimates of the misclassification rates. Resampling 
techniques such as jackknifmg or bootstrapping can produce less biased estimates while still 
5 ' using data from the training sample. 

Step 2: Use two types of logistic regression to compute estimates of the probability that a 
new subject will belong to each group. 

10 

Two types of logistic regressions are fitted to the training sample data for each of the 
discriminant functions. In both logistic regressions, the Disease Status indicator is the 
dependent ('T') variable. In the first logistic regression, the value of the discriminant 
functions based on estimation is used as an independent ("A* 1 ) variable. In the second 

15 logistic regression, the value of the discriminant functions based on prediction is used as an 
independent ("X') variable. In a third logistic regression, the biomarkers used in the 
* discriminant function are incorporated as independent ("A"') variables, along with covariates 
used in the fixed effects part of the mixed model, and the Disease Status indicator is the 
dependent ("I" 1 ) variable. The estimates from the logistic regression models are used to 

20 compute, for each subject, an estimated probability that the subject belongs to the diseased 
(Disease Status "Yes") group. The results of the logistic regression computations are not 
displayed in tables. 

Figure 1 displays the empirical distribution functions ("EDF") of the linear discriminant 
25 function values (based on estimated values) for Group D (solid line) and Group D (dashed 
line). To prepare the graph, the data for the subjects are sorted by Disease Status group and, 
within a group, by increasing values of D(\), Data points are plotted in that sequence. The 
EDF value starts at 0 (before the first subject's data are plotted) and increases by 1/n for each 
subject, where n is the number of subjects in that group. Thus, the EDF climbs from 0 to 1, 
30 separately for each group. In Figure 1, the fact that the EDF for Group D is shifted to the left 
of the EDF for Group D indicates that Group D tends to have lower scores than Group D . 
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One can see that roughly 72% of Group D subjects have D(Y) values less than 0 (the 
separation point between Group PD and Group PD), while Group D has about 44% of their 
subjects' EDF' values to the left of 0. The steepness of the groups* EDF lines near the vertical 
line at LDF=0 indicates that many subjects are "borderline" and are difficult to classify. It is 
5 possible that if an additional year of followup had been available, a number of subjects in 
Group D (in these data) would have had pain crises in the subsequent year and would have 
"converted" to Group D. 

The empirical distribution functions ("EDF") of the minimum random subject linear 
10 discriminant function values for Group D (solid line) and Group D (dashed line) are shown in 
Figure 2. The results and interpretations are similar to those in Figure 1 . However, the 
group's EDF lines are even steeper, in the vicinity of LDF=0 ¥ in Figure 2 than in Figure 1, 
emphasizing the fact that many subjects are borderline. 

1 5 These Figures reveal, as do the statistics above, that the discriminant procedures effectively 
classifies subjects who ultimately must be hospitalized for a pain crisis but, for the limited 
' data available in this example, the procedures are less effective for the subgroup who will not 
be so hospitalized. 
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Table 2. Description of Potential Biomarkers for the Sickle Cell Data 
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Table 8. Evaluation of the Discriminant Procedure Using Estimated Values 



Numbers of subjects in the 
validation sample tabulated by 
actual and classified 
membership in D. 


Subject was classified as a member of Group: 


PD 
No 


PD 
Yes 


Subject was 
actually a member 
of Group: 


D 

No 


A/ t1 =10D 
r u =56% 


A/ 12 =79 

r,, = r FP =44% 


D 

Yes 


A/ 21 =74 

r 21 — r FN —2.8% 


A/22 =188 
r 22 =72% 



r,„. = 153 /441 = 35% 
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Table 9. Evaluation of the Discriminant Procedure Using Predicted Values 



Numbers of subjects in the 


Subject was classified as a member of Group: 


validation sample tabulated by 




* 


actual and classified membership 


PD 


PD 


in D. 




No 


Yes 


Subject was actually 


B 




A/, 2 =74 


a member of Group: 


No 


r,, = 59% 


r 12 = r FP = 41% 




D 


A/ 21 =81 


N 22 = 181 




Yes 


r 2 , = r FN = 31% 


r 22 = 69% 



n„, = 155/441 = 35% 
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What Is Claimed Is : 

1 . A computer-based system for predicting future health of individuals comprising: 

(a) a computer comprising a processor containing a database of longitudinally-acquired 
biomarker values from individual members of a test population, subpopulation D of said 
members being identified as having acquired a specified biological condition within a specified 
time period or age interval and a subpopulation D being identified as not having acquired the 
specified biological condition within the specified time period or age interval; and 

'(b) a computer program that incudes steps for: 

(1) selecting from said biomarkers a subset of biomarkers for discriminating 
between members belonging to the subpopulations D and D, wherein the subset of biomarkers is 
selected based on distributions of the biomarker values of the individual members of the test 
population: and 

(2) using the distributions of the selected biomarkers to develop a statistical 
procedure that is capable of being used for: 

(i) classifying members of the test population as belonging within a 
subpopulation PD having a prescribed high probability of acquiring the specified biological 
condition within the specified time period or age interval or as belonging. within a subpopulation 
PD having a prescribed low probability of acquiring the specified biological condition within the 
specified time period or age interval; or 

(ii) estimating quantitatively, for each member of the test population, the 
probability of acquiring the specified biological condition within the specified time period or age 
interval. 

2. The computer-based system of claim 1 wherein the statistical procedure comprises a 
discriminant function utilizing the estimated mean vectors and estimated covariance matrices of 
the distributions of biomarker values within the subpopulations D and D. 

3. The computer-based system of claim 2 wherein estimates of parameters of the distributions of 
the selected biomarkers are obtained by fitting a general linear mixed model to the biomarker 
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data from the test population. 

4. The computer-based system of claim 2 wherein: 

(a) the estimated mean vectors are modeled as vector- valued functions of expected-value 
parameters or values of covariates; or 

(b) estimated covariance matrices are modeled as matrix-valued functions of covariance 
parameters or values of covariates. 

5. The computer-based system of claim 4 wherein estimates of parameters of the distributions of 
the selected biomarkers are obtained by fitting a general linear mixed model to the biomarker 
data from the test population. 

6. The computer-based system of claim 4 wherein an estimated mean vector or probability 
incorporates an estimate of the realized value of a random subject effect vector for a member 
being classified or of a member for whom a probability is estimated. 

7. The computer-based system of claim 6 wherein estimates of parameters of the distributions of 
the selected biomarkers are obtained by fining a general linear mixed model to the biomarker 
data from the test population. 

8. A computer-based system for predicting future health of individuals comprising; 

(a) a computer comprising a processor containing a database of biomarker values from 
individual members of a test population, subpopulation D of said members being identified as 
having acquired a specified biological condition within a specified time period or age interval 
and a subpopulation D being identified as not having acquired the specified biological condition 
within the specified time period or age interval; and 

(b) a computer program that incudes steps for: 

(1) selecting from said biomarkers a subset of biomarkers for discriminating 
between members belonging to the subpopulations D and D, wherein the subset of biomarkers is 
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selected based on distributions of the biomarker values of the individual members of the test 
population; and 

(2) using the distributions of the selected biomarkers to develop a statistical 
procedure that is capable of being used for: 

(i) classifying members of the test population as belonging within a 
subpopulation PD having a prescribed high probability of acquiring the specified biological 
condition within the specified time period or age interval or as belonging within a subpopulation 
PD having a prescribed low probability of acquiring the specified biological condition within the 
specified time period or age interval; or 

(ii) estimating quantitatively, for each member of the test population, the 
probabiiity of acquiring the specified biological condition within the specified time period or age 
interval; 

wherein the statistical procedure comprises a discriminant function utilizing the estimated 
mean vectors and estimated covariance matrices of the distributions of biomarker values within 
the subpopulations D and D. 

9. The computer-based system of claim 8 wherein estimates of parameters of the distributions of 
the selected biomarkers are obtained by fitting a general linear mixed model to the biomarker 
data from the test population. 

10. The computer-based system of claim 9 wherein: 

(a) the estimated mean vectors are modeled as vector-valued functions of expected-value 
parameters or values of covariates; or 

(b) estimated covariance matrices are modeled as matrix-valued functions of covariance 
parameters or values of covariates. 

1 1 . The computer-based system of claim 1 0 wherein an estimated mean vector or probability 
incorporates an estimate of the realized value of a random subject effect vector for a member 
being classified or of a member for whom a probability is estimated. 
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1 2. A method of predicting an individual's health comprising: 

collecting a plurality of biomarker values from an individual, wherein at least one of said 
biomarker values is obtained by physically measuring the biomarker value; and 

applying a statistical procedure to said plurality of biomarker values so as; 

(i) to classify said individual as having a prescribed high probability of acquiring 
a specified biological condition within a specified time period or age interval or as having a 
prescribed low probability of acquiring the specified biological condition within the specified 
lime period or age interval; or 

(ii) to estimate quantitatively for said individual the probability of acquiring the 
specified biological condition within the specified time period or age interval; 

wherein said statistical procedure is based on : 

( 1) collecting a database of longitudinally-acquired biomarker values from individual 

*. 

members of a test population, subpopulation D of said members being identified as having 
acquired the specified biological condition within the specified time period or age interval and a 
subpopulation D being identified as not having acquired the specified biological condition within 
the specified time period or age interval; 

(2) selecting from said biomarkers a subset of biomarkers for discriminating between 
members belonging to the subpopulations D and D, wherein the subset of biomarkers is selected 
based on distributions of the biomarker values of the individual members of the test population; 
and 

(3) using the distributions of the selected biomarkers to develop said statistical 
procedure. 

13. The method according to claim 12 wherein at least one of said biomarker values is obtained 
from a biological sample. 

14. The method according to claim 1 3 wherein said biological sample is a serum or urine 
sample. 
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1 5. A computer-based system for predicting an individual's future health comprising: 

(a) a computer comprising a processor containing a plurality of biomarker values from an 
individual; and 

(b) a computer program that incudes steps for applying a statistical procedure to said 
plurality of biomarker values so as: 

(i) to classify said individual as having a prescribed high probability of acquiring 
a specified biological condition within a specified time period or age interval or as having a 
prescribed low probability of acquiring the specified biological condition within the specified 
time period or age interval; or 

(ii) to estimate quantitatively for said individual the probability of acquiring the 
specified biological condition within the specified time period or age interval; 

wherein said statistical procedure is based on : 

(1) collecting a database of longitudinally-acquired biomarker values from individual 
members of a test population, subpopulation D of said members being identified as having 
acquired the specified biological condition within the specified time period or age interval and a 
subpopulation D being identified as not having acquired the specified biological condition within 
the specified time period or age interval; 

(2) selecting from said biomarkers a subset of biomarkers for discriminating between 
members belonging to the subpopulations D and D, wherein the subset of biomarkers is selected 
based on distributions of the biomarker values of the individual members of the test population; 
and 

(3) using the distributions of the selected biomarkers to develop said statistical 
procedure. 

16. The computer-based system of claim 15 wherein the plurality of biomarker values from said 
individual includes longitudinally-acquired biomarker values. 

17. The computer-based system of claim 15 wherein the specified biological condition is death 
due to a specified underlying cause of death within the specified time period or age interval. 

81 




WO 98/35609 PCT/US98/02433 

1 8. The computer-based system of claim 15 wherein the specified biological condition is a 
specified morbidity within the specified time period or age interval. 

19. The computer-based system of claim 15 wherein the specified time period is a period of at 
5 least two vears. 

20. The computer-based system of claim 15 wherein the specified time period is a period of at 
least three years. 

10 21. A method for assessing an individual's future risk of death from specified underlying causes 
of death comprising: 

collecting a plurality of biomarker values from an individual, wherein at least one of said 
biomarker values is obtained by physically measuring the biomarker value; and 

applying a statistical procedure to said plurality of biomarker values so as to determine 
15 whether said individual is classified as having a prescribed high probability of dying, within a 

specified time period or age interval, from any one of the underlying causes of death that account 
in the aggregate for at least 60% of all deaths in a test population over the specified time period 
or age interval. 

20 22. A method for assessing an individual's evidence of good health comprising: 

collecting a plurality of biomarker values from an individual, wherein at least one of said 
biomarker values is obtained by physically measuring the biomarker value; and 

applying a statistical procedure to said plurality of biomarker values so as to determine 
whether said individual is classified as having a prescribed high probability of not dying, within 
25 a specified time period or age interval, from any one of the underlying causes of death that 

account in the aggregate for at least 60% of all deaths in a test population over the specified time 
period or age interval. 



23. A computer-based system for assessing an individual's future risk of death from a specified 
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underlying cause of death comprising: 

(a) a computer comprising a processor containing a plurality of biomarker values from an 
individual; and 

(b) a computer program that incudes steps for applying a statistical procedure to said 
plurality of biomarker values so as to determine whether said individual is classified as having a 
prescribed high probability of dying, within a specified time period or age interval, from any one 
of the underlying causes of death that account in the aggregate for at least 60% of all deaths in a 
test population over the specified time period or age interval. 

24. A computer-based system for assessing an individual's evidence of good health comprising: 

(a) a computer comprising a processor containing a plurality of biomarker values from an 
individual; and 

(b) a computer program that incudes steps for applying a statistical procedure to said 
plurality of biomarker values so as to determine whether said individual is classified as having a 
prescribed high probability of not dying, within a specified time period or age interval, from any 
one of the underlying causes of death that account in the aggregate for at least 60% of all deaths 
in a test population over the specified time period or age interval, 

25. An apparatus for assessing an individual's risk of future health problems comprising: 

(a) a storage device for storing a plurality of biomarker values from an individual; and 

(b) a processor coupled to the storage device and programmed: 

1) to receive from the storage device said plurality of biomarker values; and 

2) to apply a statistical procedure to said plurality of biomarker values so as: 

(i) to classify said individual as belonging within a subpopulation PD 
having a prescribed high probability of acquiring a specified biological condition within a 
specified time period or age interval or as belonging within a subpopulation PD having a 
prescribed low probability of acquiring the specified biological condition within the specified 
time period or age interval; or 

(ii) to estimate quantitatively the probability for said individual acquiring 

■v, 
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the specified biological condition within the specified time period or age interval; 
wherein said statistical procedure is based on : 

(1) collecting a database of longitudinally-acquired biomarker values from individual 
members of a test population, subpopulation D of said members being identified as having 

5 acquired the specified biological condition within the specified time period or age interval and a 
subpopulation D being identified as not having acquired the specified biological condition within 
the specified time period or age interval; 

(2) selecting from said biornarkers a subset of biomarkers for discriminating between 
members belonging to the subpopuiations D and D, wherein the subset of biomarkers is selected 

1 0 based on distributions of the biomarker values of the individual members of the test population; 
and 

(3) using the distributions of the selected biomarkers to develop said statistical 
procedure. 

15 
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